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Abstract. We study the computational complexity of central analysis problems for One-Counter 
Markov Decision Processes (OC-MDPs), a class of finitely-presented, countable-state MDPs. 
OC-MDPs extend finite-state MDPs with an unbounded counter. The counter can be incre- 
mented, decremented, or not changed during each state transition, and transitions may be en- 
abled or not depending on both the current state and on whether the counter value is or not. 
Some states are "random", from where the next transition is chosen according to a given proba- 
bility distribution, while other states are "controlled", from where the next transition is chosen 
by the controller. Different objectives for the controller give rise to different computational 
problems, aimed at computing optimal achievable objective values and optimal strategies. 
OC-MDPs are in fact equivalent to a controlled extension of (discrete-time) Quasi-Birth-Death 
processes (QBDs), a purely stochastic model heavily studied in queueing theory and applied 
probability. They can thus be viewed as a natural "adversarial" extension of a classic stochastic 
model. They can also be viewed as a natural probabilistic/controlled extension of classic one- 
counter automata. OC-MDPs also subsume (as a very restricted special case) a recently studied 
MDP model called "solvency games" that model a risk-averse gambling scenario. 
Basic computational questions for OC-MDPs include "termination" questions and "limit" ques- 
tions, such as the following: does the controller have a strategy to ensure that the counter (which 
may, for example, count the number of jobs in the queue) will hit value (the empty queue) 
almost surely (a.s.)? Or that the counter will have lim sup value oo, a.s.? Or, that it will hit value 
in a selected terminal state, a.s.? Or, in case such properties are not satisfied almost surely, 
compute their optimal probability over all strategies. 

We provide new upper and lower bounds on the complexity of such problems. Specifically, we 
show that several quantitative and almost-sure limit problems can be answered in polynomial 
time, and that almost-sure termination problems (without selection of desired terminal states) 
can also be answered in polynomial time. On the other hand, we show that the almost-sure 
termination problem with selected terminal states is PSPACE-hard and we provide an expo- 
nential time algorithm for this problem. We also characterize classes of strategies that suffice 
for optimality in several of these settings. 

Our upper bounds combine a number of techniques from the theory of MDP reward models, 
the theory of random walks, and a variety of automata-theoretic methods. 



1 Introduction 



Markov Decision Processes (MDPs) are a standard model for stochastic dynamic optimization. They de- 
scribe a system that exhibits both stochastic and controlled behavior. The system begins in some state and 
makes a sequence of state transitions; depending on the state, either the controller gets to choose from among 
possible transitions, or there is a probability distribution over possible transitions.^ Fixing a strategy for the 
controller determines a probability space of (potentially infinite) runs, or trajectories, of the MDP. The con- 
troller's goal is to optimize the (expected) value of some objective function, which may be a function of the 
entire trajectory. Two fundamental computational questions that arise are ''what is the optimal value that the 
controller can achieve?" and "what strategies achieve this?". For finite-state MDPs, such questions have 
been studied for many objectives and there is a large literature on both the complexity of central questions 
as well as on methods that work well in practice, such as value iteration and policy iteration (see, e.g., [23]). 

Many important stochastic models are, however, not finite-state, but are finitely-presented and describe 
an infinite-state underlying stochastic process. Classic examples include branching processes, birth-death 
processes, and many others. Computational questions for such purely stochastic models have also been stud- 
ied for a long time. A model that is of direct relevance to this paper is the Quasi-Birth-Death process (QBD), 
a generalization of birth-death processes that has been heavily studied in queueing theory and applied proba- 
bility (see, e.g., the books [21, 20, 3, 15]). Intuitively, a QBD describes an unbounded queue, using a counter 
to count the number of jobs in the queue, and such that the queue can be in one of a bounded number of 
distinct "modes" or "states". Stochastic transitions can add or remove jobs from the queue and can also tran- 
sition the queue from one state to another. QBDs are in general studied as continuous-time processes, but 
many of their key analyses (including both steady-state and transient analyses) amount to analysis of their 
underlying embedded discrete-time QBD (see, e.g., [20]). An equivalent way to view discrete-time QBDs 
is as a probabihstic extension of classic one-counter automata (see, e.g, [26]), which extend finite-state 
automata with an unbounded counter. The counter can be incremented, decremented, or remain unchanged 
during state transitions, and transitions may be enabled or not depending on both the current state and on 
whether the counter value is or not. In probabilistic one-counter automata (i.e., QBDs), from every state 
the next transition is chosen according to a probabiUty distribution depending on that state. (See [9] for more 
information on the relation between QBDs and other models.) 

In this paper we study One-Counter Markov Decision Processes (OC-MDPs), which extend discrete- 
time QBDs with a controller. An OC-MDP has a finite set of states: some states are random, from where the 
next transition is chosen according to a given probability distribution, and other states are controlled, from 
where the next transition is chosen by the controller. Again, transitions can change the state and can also 
change the value of the (unbounded) counter by at most 1 . Difi'erent objectives for the controller give rise to 
different computational problems for OC-MDPs, aimed at optimizing those objectives. 

Motivation for studying OC-MDPs comes from several different directions. Firstly, it is very natural, 
both in queueing theory and in other contexts, to consider an "adversarial" extension of stochastic models 
like QBDs, so that stochastic assumptions can sometimes be replaced by "worst-case" or "best-case" as- 
sumptions. For example, under stochastic assumptions about arrivals, we may wish to know whether there 
exists a "best-case" control of the queue under which the queue will almost surely become empty (such 
questions are of course related to the stability of the queue), or we may ask if we can do this with at least a 

Our focus is on discrete state spaces, and discrete-time MDPs. In some presentations of sucli MDPs, probabilistic and con- 
trolled transitions are combined into one: each transition entails a controller move followed by a probabilistic move. The two 
presentations are equivalent. 
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given probability. Such questions are similar in spirit to questions asked in the rich literature on "adversarial 
queueing theory" (see, e.g., [4]), although this is a somewhat different setting. These considerations lead 
naturally to the extension of QBDs with control, and thus to OC-MDPs. Indeed, MDP variants of QBDs 
have already been studied in the stochastic modeling literature, see [27, 19]. However, in order to keep their 
analyses tractable, these works take the drastic approach of cutting off the value of the counter (i.e., size of 
the queue) at some arbitrary finite value A'^, effectively adding dead-end absorbing states at values higher 
than N. This restricts the model to a finite-state "approximation". However, cutting off the counter value 
can in fact radically alter the behavior of the model, even for purely probabilistic QBDs (see appendix C 
for simple examples). Thus the existing work in the QBD literature on MDPs does not establish any results 
about the computational complexity, or even decidability, of basic analysis problems for general OC-MDPs. 

OC-MDPs also subsume another recently studied infinite-state MDP model called solvency games [1], 
which amount to a very limited subclass of OC-MDPs. Solvency games model a risk-averse "gambler" (or 
"investor"). The gambler has an initial pot of money, given by a positive integer, n. He/she then has to 
choose repeatedly from among a finite set of possible gambles, each of which has an associated random 
gain/loss given by a finite-support probability distribution over the integers. Berger et. al. [1] study the 
gambler objective of minimizing the probability of going bankrupt. One can of course study the same basic 
repeated gambling model under a variety of other objectives, and many such objectives have been studied. 
It is not hard to see that all such repeated gambling models constitute special cases of OC-MDPs. The 
counter in an OC-MDP can keep track of the gambler's wealth. Although, by definition, OC-MDPs can only 
increment or decrement the counter by one in each state transition, it is easy to augment any finite change to 
the counter value by using auxiliary states and incrementing or decrementing the counter by one at a time. 
Similarly, with an OC-MDP one can easily augment any choice over finite-support probability distribution 
on integers, each of which defines the random change to the counter corresponding to a particular gamble. [1] 
showed that if the solvency game satisfies several additional restrictive technical conditions, then one can 
characterize the optimal strategies for minimizing the probability of bankruptcy (as a kind of "ultimately 
memoryless" strategy) and compute them using linear programming. They did not however establish any 
results for general, unrestricted, solvency games. They conclude with the following remark: "It is clear that 
our results are at best a sketch of basic elements of a larger theory". We believe OC-MDPs constitute an 
appropriate larger framework within which to study algorithmic questions not just for solvency games, but 
for various more general infinite-state MDP models that employ a counter. In Section 4, Proposition 17, we 
show that all qualitative questions about (unrestricted) solvency games, namely whether the gambler has a 
strategy to not go bankrupt with probability > 0, = 1, = 0, < 1, can be answered in polynomial time. 

Our goal it to study the computational complexity of central analysis problems for OC-MDPs. Key 
quantities associated with discrete-time QBDs, which can be used to derive many other useful quantities, 
are "termination probabilities" (also known as their "G matrix"). These are the probabilities that, starting 
from a given state, with counter value 1, we will eventually reach counter value for the first time in some 
other given state. The complexity of computing termination probabilities for QBDs is already an intriguing 
problem, and many numerical methods have been devised for it. A recent result in [9] shows that these 
probabilities can be approximated in time polynomial in the size of the QBD, in the unit-cost RAM model of 
computation, using a variant of Newton's method, but that deciding , e.g., whether a termination probability 
is> p for a given rational p e (0, 1) in the standard Turing model is at least as hard as a long standing open 
problem in exact numerical computation, namely the square-root sum problem, which is not even known to 
be in NP nor the polynomial-time hierarchy. (See [9] for more information.) 

We study OC-MDPs under related objectives, in particular, the objective of maximizing termination 
probability, and of maximizing the probability of termination in a particular subset of the states (the latter 
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problem is considerably harder, as we shall see). Pardy as a stepping stone toward these objectives, but also 
for its own intrinsic interest, we also consider OC-MDPs without boundary, meaning where the counter 
can take on both positive and negative values, and we study the objective of optimizing the probability 
that the limsup value is = oo (or, by symmetry, that the liminf is = -oo). The boundaryless model is 
related, in a rather subtle way, to the well-studied model of finite-state MDPs with limiting average reward 
objectives (see, e.g., [23]). This connection enables us to exploit recent results for finite-state MDPs ([14]), 
and classic facts in the theory of 1 -dimensional random walks and sums of i.i.d. random variables, to analyze 
the boundaryless case of OC-MDPs. We then use these analyses as crucial building blocks for the analysis of 
optimal termination probabiUties in the case of OC-MDPs with boundary. Our main results are the following: 

1 . For boundaryless OC-MDPs, where the objective of the controller is to maximize the probabiUty that 
the lim sup (lim inf) of the counter value in the run (the trajectory) is oo (-oo), the situation is as good as 
we could hope. Namely, we show: 

(a) The optimal probability is a rational value that is polynomial-time computable. 

(b) There exist deterministic optimal strategies that are both "counter-oblivious" and memoryless (we 
shall call these CMD strategies), meaning the choice of the next transition depends only on the 
current state and neither on the history, nor on the current counter value. 

Furthermore, such an optimal strategy can be computed in polynomial time. 

2. For OC-MDPs with boundary, where the objective is to maximize the probability that, starting in some 
state and with counter value 1, we eventually terminate (reach counter value 0) in any state, we have: 

(a) In general the optimal (supremum) probabihty can be an irrational value, and this is so already in 
the case of QBDs where there is no controller, see [9]. 

(b) It is decidable in polynomial time whether the optimal probability is 1. 

(c) There is a CMD strategy such that starting from every state with value 1, using that strategy we 
terminate almost surely. 

(Optimal CMD strategies need not exist starting from states where the optimal probability is not 1.) 

3. For OC-MDPs with boundary, where the objective is to maximize the probability that, starting from a 
given state and counter value 1, we terminate in a selected subset of states F (i.e., reach counter value 
for the first time in one of these selected states), we know the following: 

(a) The optimal probabilities can of course again be irrational. 

(b) There need not exist any optimal strategy, even when the supremum probability of termination in 
selected states is 1 (i.e., only e-optimal strategies may exist). 

(c) Even deciding whether there is an optimal strategy which ensures probabihty 1 termination in the 
selected states is PSPACE-hard. 

(d) We provide an exponential time algorithm to determine whether there is a strategy using which the 
probability of termination in the selected states is 1, starting at a given state and counter value. 

Our proofs employ techniques from several areas: from the theory of finite-state MDP reward models 
(including some recent results), from the theory of 1-dimensional random walks and sums of i.i.d. random 
variables, and a variety of automata-theoretic methods (e.g., pumping arguments, decomposition arguments, 
etc.). Our results leave open many fascinating questions about OC-MDPs. For example, we do not know 
whether the following problem is decidable: given an OC-MDP and a rational probability p e (0, 1), decide 
whether the optimal probability of termination (in any state) is > p. Other open questions pertain to OC- 
MDPs where the objective is to minimize termination probabihties. We view this paper as laying the basic 
foundations for the algorithmic analysis of OC-MDPs, and we feel that answering some of the remaining 
open questions will likely reveal an even richer underlying theory. 
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Related work. A more general MDP model that strictly subsumes OC-MDPs, called Recursive Markov 
Decision Processes (RMDPs) was studied in [10, 11]. These are equivalent to MDPs whose state transition 
structure is that of a general pushdown automaton. Problems such as deciding whether there is a strategy 
that yields termination probability 1, or even approximating the maximum probability within any non-trivial 
additive factor, were shown to be undecidable for general RMDPs in [10]. For the restricted class of 1-exit 
RMDPs (which correspond in a precise sense to MDP versions of multi-type branching processes, stochastic 
context-free grammars, and a related model called pBPAs), [10] showed quantitative problems for optimal 
termination probability are decidable in PSPACE, and [11] showed that deciding whether the optimal ter- 
mination probability is 1 can be done in P-time. In [5] this was extended further to answer qualitative 
almost-sure reachability questions for 1-exit RMDPs in P-time. 1-exit RMDPs are however incompatible 
with OC-MDPs (which actually correspond to 1-box RMDPs). The references in these cited papers point 
to earlier related literature, in particular on probabilistic Pushdown Systems and Recursive Markov chains. 
There is a substantial literature on numerical algorithms for analysis of QBDs and related purely stochastic 
models (see [21, 20, 3]). In that literature one can find results related to qualitative questions, like whether 
the termination probability for a given QBD is 1. Specifically, it is known that for an irreducible QBD, i.e., 
a QBD in which from every configuration (counter value and state) one can reach every other configuration 
with non-zero probability, whether the underlying Markov chain is recurrent boils down to steady-state anal- 
ysis of induced finite-state chains over states of the QBD, and in particular on whether the expected one-step 
change in the counter value in steady state is < (see, e.g.. Chapter 7 of [20] for a proof). However, these 
results crucially assume the QBD is irreducible. They do not directly yield an algorithm for deciding, for 
general QBDs, whether the probability of termination is 1 starting from a given state and counter value 1 . 
Thus, our results for OC-MDPs yield new results even for purely stochastic QBDs without controller. 

2 Basic definitions 

We use Z, N, No, to denote the integers, positive integers, and non-negative integers, respectively. We use 
standard notation for intervals, e.g., (0, 1] denotes {;icel.|0<.«< 1}. The set of finite words over an 
alphabet Z is denoted Z*, and the set of infinite words over Z is denoted Z^. Z'^ denotes Z* \ [s} where s 
is the empty word. The length of a given w e Z* L! Z"^ is denoted len{w), where the length of an infinite 
word is oo. Given a word (finite or infinite) over Z, the individual letters of w are denoted w{0), w{l), ■ ■ ■ (so 
indexing begins at 0). For a word w, we denote by win the prefix w(0) • • • w(n-l) of w. Let 'V - (V, ^) 
where y is a non-empty set and QV xV a total relation (i.e., for every v eV there is some u eV such 
that V — > u). The reflexive transitive closure of is denoted — > *. A path in is a finite or infinite word 
w e U such that w{i-l) w{i) for every 1 < / < len{w). A run in 'V is an infinite path in V. The set of 
all runs in "V is denoted Run-y. The set of runs in that start with a given finite path w is denoted Run<y{w). 

We assume familiarity with basic notions of probability, e.g., a cr-field, 'F, over a set Q, and a probabifity 
measure P : 9^ [Q, I], together define a prohahility space (Q,T^,'P). As usual, a probability distribution 
over a finite or countably infinite set Z is a function / : X — > [0, 1] such that Yjx&xfi^) - 1- We call / 
positive if f{x) > for every xeX, and rational if f{x) e Q for every xeX. 

For our purposes, a Markov chain is a triple M-{S, , Prob) where 5 is a finite or countably infinite 
set of states, — > c 5 X 5 is a total transition relation, and Prob is a function that assigns to each state 
5 e 5 a positive probability distribution over the outgoing transitions of s. As usual, we write s^t when 
s^t and X is the probability of 5 — > To every 5 £ 5 we associate the probability space {RunM.{s), T, V) of 
runs starting at s, where T is the cr-field generated by all basic cylinders, Run^iw), where w is a finite path 
starting with s, and 'P : — > [0, 1] is the unique probability measure such that P{RunM(w)) - O'l"/*^^"^ ^( 
where w(i-l) ^ w(i) for every 1 < j < len(w). If len(w) - 1, we put P(RunM(w)) - 1. 
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Definition 1. A Markov decision process (MDP) is a tuple T) = {V, ^ , {Vn, Vp),Prob), where V is a 
finite or countable set of vertices, "-^ QV is a total transition relation, (V/v, Vp) is a partition ofV into 
non-deterministic (or "controlled") and probabilistic vertices, and Prob is a probability assignment which 
to each v eVp assigns a rational probability distribution on its set of outgoing transitions. 

A strategy is a function cr which to each wv eV*VN assigns a probability distribution on the set of outgoing 
transitions of v. We say that a strategy cr is memoryless (M) if cr(wv) depends only on the last vertex v, 
and deterministic (D) if cr(wv) is a Dirac distribution (assigns probability 1 to some transition) for each 
wv e V*Vn- When cr is D, we write cr{wv) - v' instead of cr(wv)(v,v') - 1. For a MD strategy cr, we 
write cr(v) - v' instead of cr(wv)(v, v') = 1. Strategies that are not necessarily memoryless (respectively, 
deterministic) are called history-dependent (H) (respectively, randomized (R)). We use HR to denote the set 
of all (i.e., H and R) strategies, and we use similar suggestive notation for other strategy classes. 

Each strategy cr determines a unique Markov chain D{cr) for which V" is the set of states, and wu wuu' 
iff M m' and one of the following conditions holds: (1) m e Vp and Prob(u, u') = x, or (2) u eV^ and cr(wu) 
assigns x to the transition (u, u'). To every w e Run2)(a-) we associate the corresponding run e Runj) 
where W£)(i) is the vertex currently visited by w(i), i.e., the last element of w(i) (note w(i) e V"). 

For our purposes in this paper, an objective^ is a set O c Run^ (in situations when the underlying MDP 
D is not clear from the context, we write O^j instead of O). For every strategy cr, let be the set of all 
w e RunD(a-) such that e O. Further, for every v e y we use 0°"(v) to denote the set of all w e O"^ which 
start at v. We say that O is measurable if 0°"(v) is measurable for all cr and v. For a measurable objective O 
and a vertex v, the 0-value in v is defined as follows: Val'^{v) - sup^.^^^ f ((9'^(v)). We say that a strategy 
cr is 0-optimal starting at a given vertex v if P{0"'{v)) = Val'^iv). We say cr is 0-optimal, if it is optimal 
starting at every vertex. An important objective for us is reachability. For every set T c y of target vertices, 
we define the objective Reachj - {we Run^y \ 3i e No s.t. w(i) e T}. 

Definition 2. A one-counter MDP (OC-MDP) is a tuple, = {Q, <5=°, <5>°, {Qn, Qp), P>°), where 

- Q is a finite set of states, partitioned into non-deterministic, Qi^, awii probabilistic, Qp, states. 

- Q Qx \ X Q and 6^^ c Qx {Q,\]x Q are the sets of positive and zero rules (transitions) 
such that each p & Q has an outgoing positive rule and an outgoing zero rule; 

- P^^ and P^^ are probabiUty assignments; both assign to each p e Qp, a positive rational probability 
distribution over the outgoing transitions in 6^^ and 6^^, respectively, of p. 

Each OC-MDP, Ji, naturally determines an infinite-state MDP with or without a boundary, depending on 
whether zero testing is taken into account or not. Formally, we define MDPs and 2)^ as follows: 

- = (2 xNo, , (2/v xNo, 2p xNo), Prob). Here for all p,q e Q and j e No we have that p{0) ^7(7) 
iff ip, j, q) e S"'^. If p e Qp, then the probability of p{0) q{j) is P"^\p, j, q). Further for all p,q e Q, 
i e N, and j e No we have that p{i)\-^q{f) iff (p,j-i,q) e 6^^. If p e Qp, then the probability of 

p{i) ^ qii) is P^'^ip, j-i, q). 

- D'^ = (g X Z, , (Qn xZ,QpX Z), Prob), where for all p,q € Q and /, j e Z we have that p{i) q{j) 
iff ip, j-i, q) e 5^". \f p e Qp, then the probability of p(i) i-> q{j) is P^^ip, j-i, q). 

Since the MDPs and have infinitely many vertices, even MD strategies are not necessarily finitely 
representable. But the objectives we consider are often achievable with strategies that use only finite infor- 
mation about the counter or even ignore the counter value. We call a strategy, cr, in or D^, counter- 
oblivious-MD (denoted CMD) if there is a selector, f : Q—> 6^^ (which selects a transition out of each state) 
so that at any configuration p{n) e Q x N, cr chooses transition f{p) with prob. 1 (ignoring history and n). 
' In general, objectives can be arbitrary Borel measurable functions of trajectories, for which we want to optimize expected value. 
We only consider objectives that are characteristic functions of a measurable set of trajectories. 
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3 OC-MDPs Without Boundary 



In this section we study the objective "Cover Negative" (CN), which says that values of the counter during 
the run should cover arbitrarily low negative numbers in Z (i.e., that the hminf counter value is = -oo). Our 
goal is to prove Theorem 4. (All proofs missing in this section can be found in the Appendix.) 

Definition 3. Let Jibe a OC-MDP. We use CN^ to denote the set of all runs w e Run^^ such that for every 

Si 

n e Z the run w visits a configuration p(i)for some p & Q and i < n. 

Tlieorem 4. Given a OC-MDP, there is a CN^-optimal CMD strategy for it, which is computable in 
polynomial time. Moreover, Vaf^^ is rational and computable in polynomial time. 

We prove this via a sequence of reductions to problems for finite-state MDPs with and without rewards. For 
us an MDP with reward is equipped with r : V — > {-1, 0, 1}. For v = vq • • • v„ e V"^, let r(v) := X"=o K^i)- 

Definition 5. We denote by CN the set of all w e Run£, satisfying liminf„^cxi r(win) - -oo. We further 
denote by MP the set of all runs w e Run^ such that lim„^oo exists and hm„^oo ^^^^^^ < 0.^ 

A theorem by Gimbert ([14, Theorem 1]) implies there is always a CA^-optimal MD strategy for finite MDPs, 
because (the characteristic function of) objective CN is prefix-independent and submixing (see Section A.2). 
Lemma 7 shows for OC-MPDs there is always a CA^j?(-optimal CMD sttategy. We define several problems: 

OC-MDP-CN: 

Input: OC-MDP, and z e Z. 

Output: a CA^^-optimal CMD sti-ategy for and Vaf^^ipiz)), for every p eQ. 
MDP-CN: 

Input: finite-state MDP, D, with reward function r. 

Output: a CA^-optimal MD strategy for D, and Val™{v), for every vertex v of O. 
MDP-CN-qual: 

Input: finite-state MDP, O, with reward function r. 

Output: set A = {v | Vaf^{v) - \}, and a MD sttategy a which is CA^-optimal starting at every v e A. 
MDP-MP-qual: 

Input: finite-state MDP, D, with reward function r. 

Output: set A = {v | 3o-^ e MD : PiMP^^iy)) = 1}, a a- e MD such tiiat Vv e A : P(MP^(v)) = 1.^ 
Proposition 6. 1. There exist the following polynomial-time (Turing) reductions: 

OC-MDP-CN <p MDP-CN <p MDP-CN-qual <p MDP-MP-qual 

2. The problem MDP-MP-qual can be solved in polynomial time. 

The following lemma establishes both the first reduction of Proposition 6, part 1, and the existence of 
CA^yi-optimal CMD strategies for OC-MDPs. 

Lemma 7. Given a OC-MDP, Jl, there is a finite-state MDP with rewards, D, computable in polynomial 
time from J{, such that the set of vertices of D contains Q and for every p e Q, i e Z we have that 
Vaf^^(p(i)) - Val™{p). Moreover, for a MD strategy cr in D, let cr' be the CMD strategy in with a 
selector f defined by f(p) = o-(p). Then for each p(i) eQxZ, P(CN'^(p(i))) = P(CN"'(p)). 

* "MP" stands for "(non-positive) Mean Payoff". 
The existence of strategy a- is a consequence of the correctness proof in Section A.7. 
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Procedure Solve-CN (iD,r) 



Data: A MDP D with reward r. 

Result: Compute the vector (Vaf^iv)^ and a CA'-optimal MD strategy, a. 

1 (A,r) <- Qual-CN(2),r) 

2 {(Tr, {val,\^v) <- Max-Reach(£),A) 

3 for every v € do if v e A then cr(v) «— t(v) else cr(v) <— a-Riy) 

4 return (va/vX^y, cr 



Dealing with MD strategies simplifies notation. Although the Markov chain Dicr) has infinitely many 
states, for a finite MDP D - {V, , {V^, Vp), Prob) and a MD strategy cr we can replace £)(cr) with a finite- 
state Markov chain 2)(o") where V is the set of states, and m A m' iff m ^ mm' in 2)(cr). This only changes 
notation since for every u e V there is an isomorphism between the probability spaces RunD(cr){u) and 
RunD(^cT)iu) given by the bijection of runs which maps run w to w^, see the definition of D(cr) in Sect. 2. 

To finish the proof of Theorem 4 we have to provide the last two reductions from Proposition 6, part 1, 
prove that Val^^ is always rational, and prove Proposition 6, part 2. We do these in separate subsections. 

3.1 Reduction to Qualitative CN 

Proposition 8. Let A := {v € V | Vaf^{v) = 1}. Then for all ueV we have: 

Vaf^iu) - max 'P{Reach\{u)) = sup 'P{Reach\{u)) 

T€MD TeHR 

The reduction MDP-CN <p MDP-CN-qual is described in procedure Solve-CN. Its correctness follows 

from Proposition 8. Once the set A of vertices with Val^^ - 1, and a corresponding CA^-optimal strategy, 
are both computed (line 1, which calls the subroutine Qual-CN for solving MDP-CN-qual), solving MDP- 
CN amounts to computing an MD strategy for maximizing the probability of reaching a vertex in A, and 
computing the respective reachability probabihties. This is done on fine 2 by calling procedure Max-Reach. 
It is well known that Max-Reach can be implemented in polynomial time: both an optimal strategy and the 
associated optimal (rational) probabilities can be obtained by solving suitable linear programs (see, e.g., [7] 
or [23, Section 7.2.7]). Thus the running time of Solve-CN, excluding the running time of Qual-CN, is 
polynomial. Moreover, the optimal values are rational, so Lemma 7 impUes that Val'-'^^ is also rational. 

3.2 Reduction to Qualitative MP 

The reduction MDP-CN-qual <p MDP-MP-qual is described in procedure Qual-CN. Fixing some initial 

vertex s, let us denote by 2"^^ the set of all MD strategies cr satisfying P(MP"'(s)) = 1, and by Z*-^^ the set of 
all MD strategies cr satisfying P{CN'^(s)) = 1. It is not hard to see that c Z^^. If this was an equahty, 
the reduction would boil down to the identity map. Unfortunately, these sets are not equal in general. A 
trivial example is provided by a MDP with just one vertex s with reward 0. More generally, the strategy cr 
may be trapped in a finite loop around (causing f{MP^{s)) - 1) but never accumulate all negative values 
(causing P{CN"^{s)) - 0). As a solution to this problem, we characterize in Lemma 10 the strategies from 
which are also in Z^^, via the property of being "decreasing": 

Definition 9. A MD strategy cr in T) is decreasing if for every state u of 0{cr) reachable from s there is a 
finite path w initiated in u such that r{w) = —\. 

Lemma 10. 2"^^ is the set of all decreasing strategies from E^^. 
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Procedure Qual-CN (iO,r) 

Data: A MDP D with reward r. 

Result: Compute the set A c V of vertices with Vaf^ - 1, and a MD strategy, cr, CA'-optimal starting at every v e A. 

1 £)' <— Decreasing(D) 

2 (A',0-') «- Qual-MP(£)',r) 

3 A <- {v€ y I (v,l,0) eA'} 

4 (T<- CN-FD-to-MD(cr') 

5 return (A, cr) 



27 = (V,^, {V^, V'p),Prob'), where 

- V = {{u,n,m),[u,n,m,v] \ u e V,u^v,0 < n,m < |Kp + 1| U [div] 

- V'p = {[u,n, m, v]eV' \ ue V,,|, V'^ = V \ V; 

- transition relation is the least set satisfying the following for every u,v € V such that w^v and < m, n < | Kp + 1: 

• if m = I + 1 and n > 0, then («, n, m) ^ div 

• if m < |Vp + 1 and n = 0, then (m, n, m) [m, 1, 0, v] 

• if m < |yp + 1 and « > 0, then (m, ji, m) -^-^ [m, «, m, v] 

• if M e V;j, then [u,n, m, v] (v, n + in+1) and [u,n,m,v'] ~» (v, 1,0) for all v' e V\{v) such that [u,n,m,v'] e V 

• if M € Kjv, then [«, n, m, v] ~» (v, n + /■(«), m + 1) 

• diV ~» diV 

Prob'([u, n, m, v] ~» (v', n', m')) = Prob{u v') whenever [m, n, m, v] e Vp and [u, n, m, v] (v', n', m'). Finally, n, m)) 
I). r'([u,n,m,v]) = r{u) and r'(rf(V) = 1. 



A key part of the reduction is the construction of an MDP, £)', described in Figure 1, which simulates 

the MDP D, but satisfies that 2"^^ - Z^^ for every initial vertex s. The idea is to augment the vertices 
of £) with additional information, keeping track of whether the run under some cr e Z'^^ "oscillates" with 
accumulated rewards in a bounded neighborhood of 0, or "makes progress" towards -oo. The last obstacle 
in the reduction is that MD strategies for £)' do not directly yield MD strategies for D. Rather a CA^-optimal 
MD strategy, t', for £)' induces a deterministic CA'^-optimal strategy, t, which uses a finite automaton to 
evaluate the history of play Fortunately, given such a strategy t it is possible to transform it to a CA'^-optimal 
MD strategy for 2) by carefully eliminating the memory it uses. This is done on line 4. We postpone the proof 
of these claims to the Appendix, and just note that the construction of £)' on line 1, procedure Decreasing 
can clearly be done in polynomial time. Thus, the overall time complexity of the reduction is polynomial. 

3.3 Solving Qualitative MP 

For a fixed vertex s eV, for every MD strategy cr and reward function r, we define a random variable V[(r, r] 
such that for every run w e Run£,{o-){s): 



It follows from, e.g., [22, Theorem 1.10.2] that since cr is MD the value of V[cr, r] is almost surely defined. 
Solving the MP objective amounts to finding a MD strategy cr such that ;P(V[cr, r] < 0) is maximal among 
all MD strategies. We use the procedure get-MD-min to find for every vertex s eV and a reward function 
r a MD strategy q such that EV{q, r] - mm^j^^MD EVicr, r]. This can be done in polynomial time via linear 
programming: see, e.g., [13, Algorithm 2.9.1] or [23, Section 9.3]. 



Fig. 1. Definition of the MDP £)'. 




if the limit exists; 
otherwise. 
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Procedure Qual-MP(£),r) 



Data: A MDP T) with reward r. 

Result: Compute the set A c V of vertices with Vaf^ = 1 and a MD strategy cr M/'-optimal starting in every v 6 A. 

1 V, ^ I/, A ^ 0, r <- 0, r <- r 

2 while V? ?i do 

3 i <- Extract(V?) 

4 if 3£) : r] < tlien 

5 £) «- get-MD-min(£),r,i) 

6 C ^ a BSCC C of T){q) such that C n A = and P(y[£>, f] < | ifeac/j^) = 1 

7 (t, (reachyX^v) <- Max-Reach(2),C U A) 

8 A' <— {m e y I reachu = 1} 

9 tor every u e Vn,v € V do if (m £ C A v = £)(m)) V (m e A' \ (CUA) A v = t(m)) then T <- TU ((m,v)} 

10 A <- A' U A 

11 for every m e y do if « e A then f{u) <— 

12 _ if i ^ A then y, <- y, U {s} 

13 (T <- MD-f rom-edges(r) 

14 return (A, cr) 



The core idea of procedure Qual-MP for solving MDP-MP-qual is this: Whenever EV[t, r] < then 
there is a bottom strongly connected component (BSCC), C, of the transition graph of D{t}, such that almost 
all runs w reaching C satisfy V[t, r](w) < 0. Since Val^^(s) - 1 implies the existence of some t e E'^^ such 
that EV[t, r] < 0, Qual-MP solves MDP-MP-qual by successively cutting off the BSCCs just mentioned, 
while maintaining the invariant 3t : EV\t, r] < 0. Details and proofs are in the Appendix. 

Extract(S) removes an arbitrary element of a nonempty set S and returns it, and MD-f roin-edges(r) 
returns an arbitrary MD strategy cr satisfying (u,v) e T A u e Vn ^ o-(u) - v. Both these procedures 
can clearly be implemented in polynomial time. Thus by the earlier discussion about the complexity of 
Max-Reach, in Section 3.1, we conclude that Qual-MP runs in polynomial time. 

4 OC-MDPs with Boundary 

Fix an OC-MDP, J?l - {Q, 5=°, (5>°, {Qn, Qp), P=", P>°), and its associated MDP, D^. 

Definition 11 (termination objectives). The (non-selective) termination objective, denoted NT, consists of 
all runs w of that eventually hit a configuration with counter value zero. Similarly, for a set F Q Q 
of final states we define the associated selective termination objective, denoted STp (or just ST if F is 
understood), consisting of all runs ofD^ that hit a configuration of the form q(0) where q e F. 

Termination objectives are more complicated than the CN objectives considered in Section 3, and even 
qualitative problems for them require new insights. We define ValOne^ and ValOne^^ be the sets of all 
p{i) e 2xM() such that Vaf^{p{i)) - 1 and Vaf^{p{i)) = 1, respectively. We also define their subsets 
OptValOne^^ and OptValOne^^ consisting of all p{i) e ValOne^^ and all p(i) e ValOne^^, respectively, 
such that there is an optimal strategy achieving value 1 starting at p(i). Are the inclusions OptValOne^^ c 
ValOne^ and OptValOne^^ c ValOne^^ proper? It turns out that the two objectives differ in this respect. 
We begin by stating our results about qualitative NT objectives. 

Theorem 12. ValOne^ = OptValOne^. Moreover, given a OC-MDP, Jl, and a configuration q(i) of Jl, 
we can decide in polynomial time whether q(i) € ValOne^^. Furthermore, there is a CMD strategy, cr, con- 

ATT AIT 

structible in polynomial time, which is optimal starting at every configuration in ValOne - OptValOne . 
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Next we turn to ST objectives. First, the inclusion OptValOne c ValOne is proper: there may be no 
optimal strategy for ST even when the value is 1. See Appendix B for an example that establishes this. We 
provide an exponential time algorithm to decide whether a given configuration q{i) is in OptValOne^^, and 
we show there is a "counter-regular" strategy cr constructible in exponential time that is optimal starting at 
all configurations in OptValOne^^ . We first introduce the notion of coloring. 

Definition 13 (coloring). A coloring is a map C : Qx.'Hq ^ {b,w, g, r], where b, w, g, and r are the four 
different "colors" (black, white, gray, and red). For every i e No, we define the i-th column of C as a map 
Cf -.Q^ {b,w,g,r}, where Ci{q) - C{q{i)). 

A coloring can be depicted as an infinite matrix of points (each being black, white, gray, or red) with rows 
indexed by control states and columns indexed by counter values. We are mainly interested in the coloring, 
R, which represents the set OptValOne^^ in the sense that for every p{i) e Qx No, the value of R{p{i)) is 
either b or w, depending on whether p{i) e OptValOne^^ or not. First, we show R is "ultimately periodic": 

Lemma 14. Let N - I'^l. There is an £, \ < € < N, such that for j > N, we have Rj = Rj+{. 

Thus the coloring R consists of an "initial rectangle" of width + 1 followed by infinitely many copies of the 
"periodic rectangle" of width £ (see Fig. 2 in appendix B). Note that R^ - Rn+{- We show how to compute 
the initial and periodic rectangles of R by, intuitively, trying out all (exponentially many) candidates for the 
width { and the columns R^ - R]\f+{- For each such pair of candidates, the algorithm tries to determine the 
color of the remaining points in the initial and periodic rectangles, until it either finds an inconsistency with 
the current candidates, or produces a coloring which is not necessarily the same as R, but where all black 
points are certified by an optimal strategy. Since the algorithm eventually tries also the "real" £ and Rn - 
Rf^+e, all black points of R are discovered. We note that the polynomial-time algorithm for CN objectives is 
used as a "black-box" here and applied to various OC-MDPs constructed from J?l and the current coloring 
maintained by the algorithm (see Fig. 3). The many subtleties are discussed in Appendix B. 

Theorem 15. An automaton recognizing OptValOne^^ , and a counter-regular strategy, cr, optimal starting 
at very configuration in OptValOne^^ , are both computable in exponential time. 

Thus, membership in OptValOne^^ is solvable in exponential time. We do not have an analogous result for 
ValOne^^ and leave this as an open problem (the example in appendix B gives a taste of the difficulties). 

A straightforward reduction from the emptiness problem for alternating finite automata over a one-letter 
alphabet, which is PSPACE-hard, see e.g. [17], shows that membership in OptValOne^^ is PSPACE-hard. 

Further, we show that membership in ValOne^^ is hard for the Boolean Hierarchy (BH) over NP, and 
thus neither in NP nor coNP assuming standard complexity assumptions. The proof technique, based on a 
number-theoretic encoding, originated in [18] and was used in [16,24]. 

Theorem 16. Membership in ValOne^^ is BH-hard. Membership in OptValOne^^ is PSPACE-hard. 

As noted in the introduction, for the very special subclass of solvency games [1], all qualitative problems 
are decidable in polynomial time (see Appendix B for formal definitions and proofs): 

Proposition 17. Given a solvency game, it is decidable in polynomial time whether the gambler has a 
strategy to go bankrupt with probability: > 0, =1, = 0, or < 1. 

The cases other than < 1 are either trivial or follow easily from what we have established for OC-MDPs. 
For the case < 1, we make use of a lovely theorem on inhomogeneous (controlled) random walks [8]. 
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A Proofs of Section 3 



A.l Proof of Lemma 7 

Lemma 7. Given a OC-MDP, iR, there is a finite-state MDP with rewards, D, computable in polynomial 
time from such that the set of vertices of D contains Q and for every p e Q, i e Z we have that 

Vaf^^ipii)) = Vaf^ip). Moreover, for a MD strategy cr in T), let cr' he the CMD strategy in with a 
selector f defined by f{p) = a-{p). Then for each p{i) eQxZ, P{CN^{p{i))) = P{CN'^{p)). 

Proof Consider a MDP D^iQU (J^^, ^ , (Qn U Qp), Prob) where 

^ -.^{(p, (p, d, q)) I ip, d, q) e (5>0} U {((p, d, q), q) \ (p, d, q) e 6^^} 

mdProb(p,(p,d,q)) - P^'^{p,d,q) for every p e Qp. Consider a reward function r : (gU(5^^) {-1,0, 1} 
such that r{p) = for p e g, and r{{p, d, q)) = d for {p, d, q) e 5^^. 

Consider = (2 x Z, i-^ , {Qn xZ,QpxZ), Prob). Let (9 be a mapping of paths in D to paths in 
defined as follows: Given a finite path oj = p\{p\, d\, p2)p2{p2, d2, p^) ■ ■ ■ {pn-\,dn-\,Pn)Pn in we define 
0{cj) to be the path pi(i)p2(i + d{)--- pn(i + Ti"Zi dj). Observe that the mapping is one-to-one and onto. 

Let o" be a HR strategy in T)^. We define a strategy cr in D as follows: For every path 

- Pi{pi,di, p2)P2{P2,d2, Ps) ■ ■ ■ {Pn-i,dn-i, Pn)Pn in D wc havc that o"(a>) assigns x to a transition 
(pn, d, q)) iff a-{@{a})) assigns x to {p„{i + dj), q{i + Yj'jZi dj + d)). Let us extend to runs 
w e Runixjj-){p) by 0{w){i) - 0{w{2i)). Then : RuniX(f){p) Run£,^{a-){p{i)) is a bijection and 
induces an isomorphism of the corresponding probabiUty spaces.^ Also, 0{CN"'{p)) - CN'^(p(i)). Thus 
P{CN"'{p)) = P(CN^(p{i))), and hence Vaf^ip) > Vaf^^{p{i)) because & was arbitrary 

Let cr be a HR strategy in D. We define a strategy cr in D'^ as follows: For every path co' = p\{i)p2{i + 
di)--- Pn(i+Yj%\ dj) in we have that a-{oj') assigns x to (p„(/+Z}=} dj), ^(/+Z;={ dj+d)) iff a{0~^{oj')) 
assigns x to (pn,(Pn,d,q)). Similarly as above, P(CN"'(p)) = P(0(CN"'(p))) = P(CN'^(p(r))). It follows 
that Vaf^ip) < Vaf^^ipii)) because cr was arbitrary. This finishes the proof of 1 . 

For 2., note that if cr is a MD strategy, then the strategy cr defined in the previous paragraph coin- 
cides with the strategy cr' from the statement of the lemma on paths of D^. However, then P{CN"'{p)) - 

ncN%{pm = piCN'^ipm. 

A.2 Proof of existence of CN-optimal MD strategies 

We prove that the existence of a CA^-optimal MD strategy for finite-state MDPs with rewards follows from 

[14, Theorem 1]. To do so we need to introduce the following notions from [14]. Note that the notions are 
simpUfied to achieve an easier formulation but all the arguments can be easily modified to use the original 
notions. 

Let O c RunD be a measurable objective. We say that O is positional if there is some MD strategy (T 
such that every v eV satisfies P(0'^(v)) = sup^^^^ P(0"'(v)). Moreover O is prefix independent if for every 
run w e Run£, and every finite path w' such that w'w is a run we have that w e O iff w'w e O. Finally, O is 
submixing if for every infinite sequence of finite paths uq, vq, mi, vi, . . . such that movqMiVi • • • , uqUi ■ ■ ■ and 
vqvi • • • are runs the following is true: If movqmivi ••• e O, then uqui ■■■ e 0,ot vqvi ••• e O. Theorem 1 of 
[14] implies that every prefix independent submixing objective is positional^. 

^ I.e. for any A Q Run'D{fr){p) we have that A is measurable iff 0{A) is measurable and P{A) = P(0(A)). 

' Note that the results of [14] are more general and consider measurable pay-off functions on runs instead of sets of runs. However, 
if O is prefix-independent and submixing according to the definition given here, then clearly the characteristic fimction of O is a 
prefix independent and submixing pay-ofl" function, as defined in [14], and hence the results of [14] apply. 
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CN is clearly prefix independent. We now prove that it is also submixing. Let w - mqVoMiVi • • • be a 
run. For « e N we denote uln the subword of win obtained by leaving out all v,-parts. Similarly we denote 
vln the subword of win obtained by leaving out all M,-parts. Note that r(win) = r{uln) + r(vLn). However, 
then clearly either liminf„^oo(MLn) = -oo, or liminf„^oo(vLn) = -oo. It follows that either uqui ••• e CN, or 
vqvi • • • e CN, i.e., CN is submixing. We therefore have: 

Lemma 18 (cf. [14]). For finite-state MDPs with rewards, there always exists a CN-optimal MD strategy. 
A.3 AuxUiary lemma concerning CN objectives and MD strategies 

Lemma 19. Let cr be a MD strategy in D and let C be a bottom strongly connected component (BSCC) of 
D(cr). Given u e C,we define : Run£)(a-){u) —> Rto be a random variable giving the reward accumulated 
before the run returns to u, i.e., 



Then there is xc € {0, 1} such that for all u e C we have PiCN'^iu)) — xc. Moreover, xc - I iff for some 
u eC we have PiR^ < 0) > and ER^ < (here ER'^ is the expected value ofR^). 

Proof Let us fix w £ C. From [22, Theorem 1.10.2] we have that PiReachf^^iv)) = 1 for all v e C. Thus 
we have P{CN"'{u)) = P(CN"'(v)) because CN is prefix independent, moreover P(R'^ = oo) = 0. Hence, it 
suffices to show that P{CN'^{u)) e {0, 1), and that P{CN^{u)) = 1 iSP{R'^ < 0) > and ER'^ < 0. 

We define sequences of random variables h, 12,13... and Xi,X2,... as follows: given a run w e 
Run£,{a-){u), we define I\{w) - 0, and for all n > 2 we define In{w) to be the least m > 4-1 (w) such 
that w{m) = u. We define X„{w) = r{wiln+i{w)) - r{wiln{w)) the reward accumulated between the n-th visit 
to u (inclusive) and n + 1-th visit to u (non-inclusive). Observe that Xi = R'^ and that the variables Xi , X2, ■ ■ . 
are identically distributed and independent. Therefore, the sequence Xi,X2,... determines a random walk 
5o,5 1,52, ... on Z where 5„ = A-;. 

Suppose that P(R^ < 0) > and £7?^ < 0. There are two cases depending on whether P(R^ > 0) = 0, or 
not. First, assume that P(R'^ > 0) = and thus also EXi = EXj < for all j. Then almost all w e Run£)(a-)(u) 
satisfy the following: X,(w) < for every / > 0, and Xj{w) < for infinitely many j > 0, as follows from 
the strong law of large numbers, see e.g. [2, Theorem 22.1], and the fact that EXj < 0. However, then 
PiCN"') = 1. Now assume that P(R^ > 0) > 0. We may apply, e.g., [6, Theorem 8.3.4] and conclude that 
almost all w e Run£)(a-)(u) satisfy liminf„_,oo Sn(w) = -00, which implies that P{CN"') = 1. 

Now suppose that either P{Rl < 0) > 0, or ER'^ < is not satisfied. If P{R'^ < 0) = 0, then clearly 
for all w e RunD(a-)iu) and for every n > we have r{win) > -\V\, which imphes that P{CN"') - 0. 
If P{R'^ < 0) > but ER'^ > 0, then using, e.g., [6, Theorem 8.3.4], almost all w e Runo(o-)(u) satisfy 
Um^^oo Sn(w) = 00, which implies that PiCN"') = 0. 

A.4 Proof of Proposition 8 

Proposition 8. Let A ■.^{veV\ Vaf^{v) = 1}. Then for all ueV we have: 




r{win) ifn- min{j > 1 | w(j) - u) < 00 
00 otherwise 



Val™iu) = max P{Reach\{u)) = sup P{Reach\{u)) 
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Proof. The fact that md^T^^MD'P{Reach^j^{u)) = sup^^fjj^ 'P(Reach^^(u)) follows from [23, Section 7.2.7], 
see also [7]. Clearly maXreMo'PiR^'^ch^A^u)) < Val^^{u). For the opposite direction, let us pick a CN- 
optimal MD strategy cr. Consider the Markov chain D{cr) with states V. By Lemma 19 (see Section A.3), 
for every BSCC C of D{o-) there is a number xc e {0, 1} such that xc = 'P(CN"'(v)) = Vaf^(v) for all 
V e C. Let us denote by C the union of all BSCCs C such that xc = 1. Let ;r be a MD strategy such 
that P{Reach1.{u)) = maXreMD PiReach^iu)). Then P{CN"'{u)) < P(Reach'l.(u)) because almost all runs of 
D{cr) eventually reach a BSCC. However, C QA, and thus 

Val™iu) = nCN^iu)) < PiReachliu)) < P{Reach\{u)) < max P{Reach\{u)) 



A.5 Proof of Lemma 10 

We fix an arbitrary initial state s and consider the sets of strategies 2"^^ and defined with respect to s, 
see Section 3.2. Recall that a MD strategy cr in 2) is decreasing if for every state u of D{cr) reachable from 
s there is a finite path w initiated in u such that r(w) = -1. We restate and prove Lemma 10 here. 

Lemma 10. Z^'^ is the set of all decreasing strategies from . 

Proof. Let cr be a MD strategy. Denote C the union of all BSCCs of 2)(cr) reachable from s. From [22, 
Theorem 1.10.2] we have that P{Reach'^{s)) - 1 Let u e C. Similarly as in the proof of Lemma 19 (see 
Section A.3), we define sequences of random variables I\,l2,h ■ ■ ■ and Xi,X2, ... as follows: given a run 
w e Run£)(o-)(u), we define Ii(w) = 0, and for all « > 2 we define In(w) to be the least m > /„_i(w) such that 
w(m) = u. We define X„{w) = r{wiln+iiw)) - r{wilniw)) the reward accumulated between the n-th visit to 
u (inclusive) and n+ 1-th visit to u (non-inclusive). Observe that Xi = R^- We define D„ = /„+i(w) - I„(w). 
Observe that both Xi,X2,... and Z)i , Z)2, . . . are sequences of identically distributed and independent random 
variables. Also EXi is finite, < EDi < oo, and Xi = 7?^ where 7?^ is the variable defined in Lemma 19. 
By the strong law of large numbers, for almost all w e Run£)(cj-){u) 

n-^oo n n->oo Di{w) n n->oo n 

Assume that cr e . Let u e C. We have P(CN"'(u)) - 1 because CN is prefix independent and u is 
reachable from .v. Then, by Lemma 19, ER'^ < 0, and hence P(MP"'(u)) = 1 by the above equation. It 
follows that cr e because u was an arbitrary state of C, almost all runs initiated in s reach C, and MP is 
prefix independent. 

Assume that cr e Z'^^ and that cr is decreasing. Let u e C. We have P(MP^(u)) - 1 because MP is 
prefix independent and u is reachable from s. Then, by the above equation, ER"^ < 0. Also, PiR"^ < 0) > 
because cr is decreasing. Hence, by Lemma 19, P{CN"'{u)) = 1. It follows that cr e Z'^'^ because u was an 
arbitrary state of C, almost all runs initiated in s reach C, and CN is prefix independent. 



A.6 Properties of £)' and the correctness of Qual-CN 

Recall the MDP V from Figure 1. In this section we prove some of its properties and prove that the proce- 
dure Qual-CN from Section 3.2 is correct. Also recall that whenever we use the sets Z^^ and 2"^^ an initial 
vertex s has to be specified, see the definition of the sets in Section 3.2. 

Lemma 20. Let an initial vertex s €V be fixed and let cr e Z^^ . For every state u of !D{(t) there is a finite 
path w of length at most [Vp -i- 1 initiated in u such that r{w) - -1. 
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Proof. Let w be the shortest path initiated in u such that r{w) = -1. Observe that if there are / < j such that 
w{i) = w{j) and r{wii) < r(wij), then the path is not the shortest one (consider the path ^(0) • • • w{i)w(j + 
1) • • • )• However, then every vertex can occur at most \V\ times in wl{len{w) - 1). This gives | + 1 upper 
bound on the length of w. 

Before we proceed to formal treatment, we briefly explain the intuition behind the construction of D'. 
We start with explaining what information is kept in the vertices of £)'.In what follows, vertices of the form 
(m, 1,0) for some ueV axe called checkpoints. 

- First coordinate: the current vertex of D; 

- second coordinate: the number by which the counter has to be decreased to make the sum of rewards 
gained since the last checkpoint negative; 

- third coordinate: the number of steps since the last checkpoint; 

- fourth coordinate, if present: the next vertex of D through which the "short path" from the last check- 
point, see Lemma 20, should continue. 

When the run starts, the first counter in the current vertex is 1 indicating that we wait for the sum of rewards 

becoming -1, and the counter of steps is set to 0. As the play proceeds, the counters are updated accordingly. 
Whenever the first counter reaches value zero, the play reaches a checkpoint and the counters are reset to 1 
and 0, respectively. Lemma 20 allows us to bound the (nonnegative) counters in vertices of £)' by |Vp + 1 
and use them to make the strategy choose the right successor in transitions of the type (u, m, n) ~> {u, m, n, v] 
so that V is the successor of u on the "short path" from Lemma 20. If the strategy chooses a bad successor, 
the player gets "punished" in terms of not satisfying the MP objective by entering a special vertex div (for 
diverge). Indeed, if the counter of the steps overflows with the accumulated reward from the last checkpoint 
being nonnegative, the play gets stuck in div and the objective MP is not satisfied. 

Lemma 21. Let s e V be arbitrary. The following is true. 

1. Every MD strategy cr' in T)' satisfying P{MP'^, ^,((5', 1,0))) = 1 is decreasing. 

2. For every MD strategy cr in O there is a MD strategy cr' in O' such that P{CN'^ ^{{s, 1,0))) = 1 for 
every s eV such that f'{CN'^ ^{s)) - \. 

Proof ( ofl.). First, observe that div is not reachable from {s, 1,0). Let (w, n, m) be a state of D'<cr') reachable 
from {s, 1,0). First, assume that n > 0. There is a path w from (u,n,m) to a state of the form (u',0,m'), 
otherwise div would have been reachable from (.v, 1,0). Let k = For every < / < ^ we denote 

(v,-, Mi) - w{2i). Then n^ = n > and ni = n + Ti^Jq r(vj) for 1 < j < A:. It follows that n + r'(w) - 

n + TIjZo ''(^i) = n/c = 0. This implies /(w) < 0. 

Now assume that n = 0. Then {u,n,m) ^ [u, 1,0, v] and [u, 1,0, v] ^ (v, 1 + r{u), 1). Denote w' - 
(u,n,m)[u, 1,0, v]. If r(u) - -1, then r'(w' ■ (v, 1 + r(u), 1)) - r(u) - -1 and we are done. If r(u) > 0, 
then 1 + r{u) > and arguing as above we obtain a path w from (v, 1 + r(u), 1) to some (u', 0, m') such that 
1 + r{u) + r'{w) = 0. However, then 1 + r'(w'w) = 1 + r{u) + r'{w) = and r'{w'w) = -1. 

Let [u,n,m,v] be a state reachable from {s, 1,0). Then n > and there is a transition [u,n,m,v] ^ 
(v,n + r(u),m + 1). Arguing as above, we obtain that there is a path w from (v,n + r(u),m + 1) to some 
(u',0,m') such that n + r(u) + r'(w) - 0, which implies that the r'([u,n,m,v] ■ w) - r(u) + r'(w) < -1 and 
thus r'([u,n,m,v] ■ w') = -1 for some prefix w' of w. 

Proof (ofl.). For every v £ V and < m < | Vp we denote by P{v, m) the set of all paths in D((t) of length 
at most |yp + 1 - m initiated in v. We denote val(v,m) - min{r(w) | w e P(v,m)} and val(v, |yp + 1) = 0. 
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For m < |yp choose 6(v, m) e V to be an arbitrary vertex u such that v ^ m is a transition of D{<t} and 

val{u,m + 1) = ram{val{u ,m + 1) | v ^ m' in D{cr)} 

Let us define a strategy cr' as follows: 

- Let (m, n, m) e V'^. 

• lfm = |yp + 1 and « > 0, we put cr'((u, n, m)) = div 

• If m < I + 1 and « = 0, we put cr'{(u, n, m)) = [u,\, 0, G{u, 0)] 

• If m < |Vp + 1 and « > 0, we put o-'{{u,n,m)) = [u,n,m,6{u,m)] 

- For every [u, n, m, v] e V'j^, we put (r'([u, n, m, v]) - (v,n + r(u), m + 1) 

Fix an arbitrary s e V such that P{CN'^^(s)) - 1. Denote by R the set of all states of the form {u,n,m) 
reachable from (s, 1,0). We prove that n + val(u,m) < for all (u,n,m) e Rhy induction on m. If m = 0, 
then n - I and Lemma 20 imphes val(u, 0) < -1. 

Consider (u, n,m) e R such that m > 0. Then (u', n', m') [u', n" , m" , u] — > (u, n, m) in D(cr) for some 
(u',n',m') e R. Now either n" - 1 and m" - 0, or n" = n' and m" - m'. First, assume that n" - 1 
and m" - 0. Then n - I + r{u') and m = 1. By Lemma 20, val(u', 0) < -1 and thus by definition of cr', 
1 + r(u') + val(u, 1) < 0. Now assume that n" - n' and m" - m' . Then n - n' + r(u') and m = m' + 1. By 
induction hypothesis, n'+val(u', m') < 0, and thus by definition of cr', n' +r(u')+val(u, m) < n' +val(u' ,m') < 
0. 

This proves that if {u, n, \V\^ + I) e R, then n = 0. It follows that div is not reachable. Given [u, n, m, v] e 
V, we define 0{[u,n,m,v]) - u. Given w e Run£)'(cr')((s, 1,0)), we define a run 0(w) e Run2)(a-)(s) by 
0(w)(k) = 0(^(2^+1)). We have that induces an isomorphism of the probability spaces Run2)'(o-')((s, 1,0)) 
and RunD(fr){s)- Indeed, it follows from the following three facts: First, div is not reachable. Second, if 
M e Va? and [m, n, m, v] e R, then cr{u) - v and [m, n, m, v] (v, n", m") [v, n',m', v'] in V{o-'} for some 

X 1 

n",m",n',m',v'. Third, if m e Vp and [u,n,m,v] e R, then [u,n,m,v]^(u',n",m")^[u',n',m',v'] in 
D'icr') for some n" ,m" ,n' ,m' ,v' iff m^m' is assigned x in !D. Also, w e CA^^', 1,0)) iS 0(w) e 
CN'^/s). Thus PiCN'^/s)) = P{CN'^^^,{{s, 1,0))) = 1. 

So far we have that, for a fixed initial vertex s eV,i{ Z^^ in iD then Z^'^ = in £)'. It 

remains to prove the other impUcation. We do this in two steps and we need the following notion: 

Definition 22. A deterministic strategy cr in O is said to be finite-memory (FD) if there is a deterministic 
finite automaton (DFA) J{ such that for every wu e V*V the value of cr(wu) depends only on u and the 
current state kofJ{ after reading w (we write cr(u, k) instead ofcr(wu)). 

Lemma 23. Given a MD strategy cr' in D' there is a FD strategy cr in O computable in polynomial time 
such that P(CN'^/s)) = 1 for every seV with P(CN'^,^((s, 1,0))) = 1. 

Proof. Let us define cr as follows: Let = {K, V, ^, kin) where 

- K consists of all vertices of V of the form [u, n, m, v]. 

- ^ is defined as follows: Let [u,n,m,v] e K and u' e V. If w-^u', then we define ^{[u,n,m,v],u') to 
be the unique vertex of the form [u',n',m',v'] satisfying [u,n,m,v] ^ (u' ,n" ,m") ^ [u' ,n' ,m' ,v'] in 
D'icr'} for some n" and m". Otherwise, we define ^([u, n, m, v], u') to be an arbitrary state of 

- Define - cr'((s, 1,0)). 
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For u e Vn, we define cr{u, [u, n, m, v]) - v. For u' ^ uwe define o-(u', [u, n, m, v]) to be an arbitrary vertex 
u" such that u' "—^ u" . 

The rest is similar to the end of the proof of Lemma 21 . Given [w, n, m, v] e V, we define 6>([m, n, m,v]) - 

u. Given w £ Run£^(cr')((s, 1,0)), we define a run 0(w) e Run^^cr)(s) by 0(m')(A:) - 0(w(2A: + 1)). Then 

induces an isomorphism of the probability spaces Run £>' (a-' ){{s, 1,0)) and Run£,((y){s). Indeed, it follows 

from the following facts: First, div is not reachable from {s, 1, 0) in 2)'(cr'). Second, if [m, n, m, v] e VL, then 
11 

[M,n,m,v] — >(v,n",m")— > [v,n',m',v'] in D{(t') for some n" ,m" ,n' ,m' ,v' . Third, for [u,n,m,v] e VL 

X 1 

[u, n, m, v] — » {u',n",m") [u',n', m', v'] for some n", m", n', m', V iff the transition m <^ v is assigned x in 
£). Also, w e CA^^', ,,((^, 1,0)) iff 0(w) e CA^^ ,(j). Thus nCN'^/s)) = r{CN'^ ^,{{s, 1,0))) = 1. 

Remark 24. Since the DFA in the proof of Lemma 23 effectively simulates the Markov Chain M, we 
will simpUfy the notation used in the procedure CN-FD-to-MD by identifying the MD strategy for £)' with 
its associated FD strategy for O. 

Lemma 25. Let cr' be a FD strategy in D. Then the procedure CN-FD-to-MD computes in time polynomial 
in the size of the DFA associated with cr' a MD strategy cr such that PiCN'^is)) - 1 for every s e V with 

ncN'^/s)) = 1. 

Proof. Denote J?l the DFA associated with cr'. Further denote K the set of its states, k its initial state, ^ its 
transition function. Recall that the input alphabet of such an automaton is V, the set of vertices of the MDP 
D. We combine with D and cr' by means of parallel sequential composition into a finite Markov chain 
At. More precisely, the set of vertices of At is the setVxK and the transitions and probabilities are defined 
as follows: For u e and k e K we put (u,k) (u',k') if and only if <T'(u,k) - u' and k' - ^(k, u). For 
u eVp and k e K we put (u, k) ^ (u', k') if and only liw-^u' is assigned the probability x and k' — ^(k, u). 
Given {u, k) eV y. K,we denote the projection n\{{u, k)) - u and define r'((M, k)) - r{u). 

The following procedure CN-FD-to-MD computes a sequence of Markov chains M„, < n < |V| with 
state spaces VxK, transitions „ and probabilities Prob„. Then it extracts the strategy cr from the last M„, 
n - \V\. For every < n < \V\, let Cn be a union of all BSCCs of M„ reachable from (s, k). We say that u e V 
is ambiguous in C„ if for at least two ki,k2 e K, k\ k2, both (u, ki), (u, ^2) £ C„. For every M„ an an initial 
vertex (m, k) we define a random variable R(u,k) as follows: given a run w we set 5 = {m > | ni{w{m)) = u] 
and put 

(r'iwim) S +%,m- min5 
""^"•^^^"^ ^ |x 5.0 

Since every M„ is finite, R{u,k) is almost surely defined whenever {u,k) hesj e V with P{CN'^^{s)) = 1 in a 
BSCC, and the expectation ER(u,k) is finite, see [22, Theorem 1.10.2]. 

Procedure CN-FD-to-MD computes cr, we first estimate its running time. The while-loop on line 2 is 
executed at most |y|-times because every state {ua,k) picked in step 3 is no longer ambiguous in later 
iterations. We show that the picking in step 3 takes polynomial time. First, due to, e.g. [12, XV.7], ER(ua,k) 
can be expressed as a unique solution of a linear system of equations, computable in polynomial time. So 
we can compute ER^u„,k) in polynomial time and check whether ER(ua,k) ^ 0. Second, the problem whether 
P{R{ua,k) < 0) > is equivalent to the existence of a negative weighted cycle in the BSCC containing {Ua,k), 
which can be decided in polynomial time using, e.g., the Bellman-Ford algorithm. Time complexity of the 
procedure Max-Reach on line 9 has already been analyzed in Section 3.1. 

Let us prove correctness. Fix some s e V with P(CN'^ ^(s)) = 1. We prove by induction that for all h, 
<h < \V\: P(CN((s,k))) - I in Mh. For h - this is true by the choice of s. Assume that the statement is 
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Procedure CN-FD-to-MD (cr') - computing a MD CN-optimal strategy from a FD one. 



Data: The product Markov Chain M determined by the strategy cr' . 
Result: Produce a CN-optimal MD strategy cr. 

1 M «- 0, Mo <- M 

2 while there are states ambiguous in C„ do 

3 Pick (Un, k) € C„, such that Ua is ambiguous in C„, ER^^^j^) < 0, and P(uaj!)iR < 0) > 0. 

4 Compute M„+i from M„ as follows: 

5 Set {v,k')^ „+i(Ua,k) iff (v,k')^ „(ua,k") for some k". 

6 Set (v, k) A „+i(m, fe') iff (v, A:) A „(m, A') for « Ua- 

7 w «— n + 1 

8 C<-{u€ V\(u,k)eCn]. 

9 (£),-)<- Max-Reach(£),C) 

10 for M € V do if M ^ C then cr(u) = q{u) else a(u) = v where 3k,k! € K : (u, k) „(v, kf) 

11 return cr 



true for some h - n e'Nq, we prove it for /z = « + 1. First, we prove that if there is some v ambiguous in C„, 
then there is (v, k) e Cn such that ER(y^k) ^ and P(R{v,k) < 0) > 0. Let C be a BSCC of M„ reachable from 
{s, k) and containing at least two states from {v} x K. Let us denote := ({v} xK)riC - {(v, ki),..., (v, kf)}. 

We define sequences of random variables /q, /i , /2 . . . and X[,X'2,. .. where ? e {1 , . . . , ^} as follows: Let 
w be a run in M„ initiated in some (v, kj) e C^. We define Io(w) = 0, and for all j > 1 we define Ij(w) to be 
the least m > Ij-i{w) such that w{m) e O'. Let i e {\, . . . ,€} and let mi,m2,m^, ... be all indexes such that 
lmj{w) = {v,ki). We define X'j{w) = r'{wllmj+i{w)) - r'{wilmj{w)) the reward accumulated between the j-th 
visit to (v, ki) and next visit to C. 

Consider the Markov chain M„. Observe that EX'^ is independent of the actual initial vertex (v, kj) e 
and that EX'^ = ER(^^,^i^.y Also, for a fixed /, \ <i < €, the variables X\, j > 1 are independent and identically 
distributed. We claim that EX'^ < for some / and !P(XJ < 0) > 0. Assume, to the contrary, that there is no 
such / and let us denote B - {i\ EX\ > 0}. The variables X\ generate { random walks of the form 5 p 5 2, . . . 
by 5^, - Yl^j^i X'j. For every j e B the walk 5^ drifts almost surely to co, by, e.g., [6, Theorem 8.3.4]. On 
the other hand, for every i e {I, ...,£} \ B the walk never reaches values smaller than a fixed number. Since 
for almost all runs starting in some (v, kj) e C^' we have liminf„^oo r{win) = liminf„^oo ^J,, it follows 
that in M„: 'P{CN{{v,kj))) = 0, and hence P{CNiis,k))) < 1, a contradiction. 

Now we prove that in M„+i: P^^j^^{CN) - 1. Assume that (ua,k) is the state selected in step 3. Then the 
expected value ER(u„,k) is the same in both M„+i and M„ and thus not positive. Further P(R(ua,k) < 0) > 
in M„+i and no states of the form (ua, k') where k' k we. reachable from {ua, k) in Mn+i- By Lemma 19, 
P{CN{(ua,k))) = 1 in M„+i. Let A be the set of all runs of RunM„{{s,k)) not reaching {Ua,k). Clearly, the 
probability of A is the same in M„ as in M„+i. Hence, P{CN{{s, k))) = 1 in M„+i. 

Finally, note that for every n the Markov chain M„ has the following properties: 

- For each (u, k) e VpX K, if (u, k) A (v, k'), then w^v. 

- For each {u, k) eVf^ x K, if (m, k) A (v, k'), then x = 1 and w^v. 

Hence, in step 8 the set C is reachable from s with probability 1 using a suitable MD strategy g, Une 9. 
Consequently the strategy cr for D is well-defined (line 10) and satisfies P{CN^{s)) = 1. 

The correctness of the reduction represented by the procedure Qual-CN follows from Lemma 21, 
Lemma 23, Remark 24, and Lemma 25. 
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A.7 Correctness of Qual-MP 

Denote W - {s e V \ 3a MD strategy cr : P(MP^(s)) = 1}. In this section we prove that the set A and MD 
strategy cr computed by the procedure Qual-MP satisfy: eW : P(MP'^) = 1 and W = A. 

Choose an arbitrary s e V. Let cr be a MD strategy. Let us denote BSCC[D{o-)] the set of all BSCCs 
of D{o-} reachable from s. By standard arguments from the theory of Markov chains (see e.g. [22, Sec- 
tion 1.5]), TiC€BSCC[D(a-)] P(Reach'^(s)) - 1. Recall also the random variable V[o; r] defined in Section 3.3. 
In particular recall that [22, Theorem 1.10.2] implies that for almost all runs w V[cr, r](w) - lim„^co 
Moreover, using [22, Theorem 1.10.2] again, for every C e BSCC[D{o-)] there is a constant ac & M. such 
that V[<T, r] = ac almost surely on the condition of hitting C. Thus for the expected value we have 

EV[o-, r] = «c • P(Reach'^(s)) 

CeBSCCWia-)] 

We prove that there is a MD strategy g computable in polynomial time such that EV[g, r] - mino- EV[cr, r] 
where the minimum is taken over MD strategies. 

Let cr be a MD strategy. We define a sequence of random variables Viler, r],V2[o; r], .. . such that 
Vn[(r,r] = r(vv'J,«) for every run w e /?m«£)((;-)(mo) and every n > I. Let us denote EVn[(r,r] the expected 
value of Vnlcr, r] (i.e. EVAcr, r] = / • !P(V„[(r, r] - /))• 

Note that 

EV„[cr, r] ^ i ■ 9{VnW, A ^ i) ^ y / ^^ Vn[cr,r] ^ i^^^ Vn[o-,r] 

n n ^ n n n n 

i=-n 

and that [ ^"'-^''^^ 1 < 1. Hence by the dominated convergence theorem (see e.g. [2, Theorem 16.4]) 

lun um E EV{(r, r] 

n-^oo n n-»oo fi 

Using either [13, Theorem 2.9.4], or [23, Theorem 9.3.8], and a P-time algorithm for linear programming, 
one can construct a polynomial time algorithm which computes a MD strategy g such that (taking the minima 
over MD strategies) 

EVn[g,r] EV„[o-,r] 
EVlg, r] - lim = min lim = min EV[o-, r] 

n-ico fi a n— >oo fi cr 

and also computes the value EV[g, r]. 

In the proof of correctness and the complexity estimates of Qual-MP we will denote r,, T,-, and A,- the 
reward represented by r, the content of the set T, and the set A, respectively, before the i-th iteration of the 
while-loop, in particular ro - r,To - 0, and Aq = 0. We also denote gi the strategy g from line 5 computed 
in the i-th iteration of the while-loop. 

Choose some j e W so that there is a strategy cr such that P(MP^(s)) - 1, i.e., V[&, r](w) < almost 
surely. Given i, we define a MD strategy such that for every u eV 



(r\u) = 



I V (m, v) e Ti 
I d-{u) otherwise. 



The algorithm keeps the following invariants: 

(a) V[cr', r,](w) < and V{cr\ r](w) < almost surely. 
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(b) For every u e V \Ai and every strategy cr in D, the probability of reaching A, from u is strictly less than 
1. There is no path from any state of A, to V \ A, in D{(t'). 

(c) A and V? are disjoint. 

The invariant (c) follows by an easy induction from lines 1 and 12. 

Clearly, the invariant (a) implies that on line 5 the strategy g always exists. We prove that a BSCC C 
from line 6 exists. Note that by the invariant (b), for all C e BSCC[!D{<t}] either C n A,- = 0, or C c A,-, and 
there must be at least one C such that C n A, - 0, otherwise s could not have been in A;, contradicting line 3 
and the invariant (c). Also there are numbers ac,i for every CBSCC[D{gi}] such that V[gi, r,] = ac,i almost 
surely on the condition of hitting C, and 

EV[gi, r;] = «c,i • PiReach^^is)) 

CeBSCC[D{ei}] 

However, all C £ BSCC[D{gi}] such that C c A, satisfy ac,i - 0. Hence, there must be at least one 
C„it e BSCC[D{gi)] such that C„,,y n A,- = and flc,„„. < 0. 

Now every D e BSCC[D<cr'+')] satisfies either D - Cwu £ A,+i \ A,, or D c (V \ A,+i) U A,- and 
D € B5CC[D(a"'>]. Moreover, transitions between states of Cwu in D{cr''^^} coincide with transitions be- 
tween states of Cwit in D{(ri). Also, transitions between states of every D + Cwn in £)(cr'"^^) coincide with 
transitions between states of D in D{(t'). 

Then almost all w e Reach'^ \s) satisfy V{a-^*^ ,ri+\]{w) < because r,+i assigns to all states of C„it. 

Also, almost all w e Reach^ {s) where D Cwu satisfy V[o-^*^ ,ri+i]{w) = V[o-',ri]{w) < due to the 
invariant (a) for /. It follows that V[(t'+^, ri+iKw) < for almost all runs w e Run^^^i+i^is). 

Moreover, almost all w e Reach^^^^{s) satisfy V[a"'+\ rJCw) < because r, coincides with r on Cwu 
and almost all runs w € ReachZ.' (s) satisfy V[cr,, r,](w) < 0. Also, almost all w e Reach'C\s) where 
D + Cwit satisfy V[cr''^^ , r]{w) = V[o-',r](w) < due to the invariant (a) for i. Hence, for almost all runs 
w e Runj)(^^+iy(s) we have V[(t''^^ , r](w) < 0. 

It follows that the invariant (a) is preserved. The invariant (b) is preserved due to computation of r on 
line 7, A' on line 8 and update of A in line 10. Finally, the strategy cr defined on line 13 has the desired 
properties because it coincides with cr'"^' on all reachable states, and cr'^^ satisfies the invariant (a). This 
also implies that the vertex s was put into A' on line 8 and consequently to A on line 10 in some iteration 
of the while-loop. Thus W Q A. Since by arguments similar as above we can show that for every s e Av/e 
have P{MP^(s)) = 1 the correctness is proved. 

Let us now consider the complexity. By [22, Theorem 1.10.2], for every C e BSCC[D{gi}] the is a 
constant ac,i e 1- defined above is equal to Y^uec A'(w) ' ^tiu), where /i is the invariant distribution for C (note 
that C can be considered as a standalone irreducible Markov chain within Digi)), which is a unique solution 
of a system of linear equations, and thus computable in polynomial time. Hence, a suitable BSCC satisfying 
the conditions from line 6 can be computed in polynomial time. In Section 3.3 we already showed that the 
strategy g from line 5 can be found in polynomial time. In Section 3.1 we showed that also finding the 
strategy r on line 7 can be done in polynomial time. Other steps can be clearly taken in polynomial time. 
Since the set A grows with every iteration of the while-loop by at least one vertex, the loop itself is executed 
at most |y|-times. Thus the procedure Qual-MP runs in polynomial time. 
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B Proofs of Section 4 



For the rest of this section, we fix an OC-MDP J{ = (Q, 5"", (Qn, Qp), P"", and a non-empty set 
F c g of final states. We assume (without restrictions) that for each ^ e F, the configuration ^(0) has only 
one outgoing transition ^(0) i-> ^(0). We also use to denote 2' 21. 

Obviously, OptValOne^ c ValOne^ and OptValOne^^ c ValOne^^, but it is not immediately clear 
whether the inclusions are proper. As we shall see, the sets OptValOne^^, ValOne^'^ , and OptValOne^^ 
have a regular structure which can be captured by finite state automata, and optimal strategies are either 
counter-obUvious or counter-regular. 

Definition 26 (regular sets of configurations, counter-regular strategies). An j?l-automaton is a pair 
(M, q) where M = (C, {a}, y, F) is a deterministic finite-state automaton and g : Q ^ C a mapping. A set of 
configurations of^ recognized by (M, f) consists of all p(i) e QxNq such that M accepts the word a' from 
the initial state g(p). A set of configurations is regular if it is recognized by some ^-automaton. 

A MD strategy cr is counter-regular if there is an ^-automaton {M,g) and a function / : QxC ^ d^^, 
where C is the set of states of M, such that for all p{i) ^ Qxf^ we have that cr{p{i)) = f{p, q), where q e C 
is the state entered from g(p) after reading the word a\ 

We start by proving the results about NT objectives. 

Theorem 12. The sets ValOn^ and OptValOne^ are equal. Moreover, given a OC-MDP, SK, and a 
configuration q(i) of ^ we can decide in polynomial time whether q{i) e ValOne^^. Furthermore, 
there is a CMD strategy cr constructible in polynomial time which is optimal in every configuration of 
ValOne^'^ = OptValOne^'^ . 

Proof. We start by showing that for all i>\Q\ and all p e Q such that p(i) e ValOne^ we have that 

1 = sup piNripm = sup piCN'^ipim (i) 

T€HR T€HR 

Let us fix some p{i) e Q xMq where / > \Q\. Consider an arbitrary HR strategy t for D^. For every 
< j < i, we define the set V^j Q Q which consists of allq e Q such that with probability > a run from 
p(i) under t visits q(j) before visiting any other configuration s(k) with k < j. Consider further an arbitrary 
infinite sequence ei,£2, . . . of positive reals where lim„^co = 0, and an infinite sequence of strategies 
cri,(T2, . . . such that P{NT"'j{p{i))) > I - sj for all / Since there are only finitely many collections of / -i- 1 
subsets of Q, there are subsequences , e^^, . . . and cr^j , cr^;^, . . ., and a collection Uq, ...,UiQQ such that 
Um„^oc, Sd„ = 0, P{NT"'''j(p{i))) > 1 - Sdj for all j, and moreover Uk = u'^''' for all j and all < A: < i. 

Since i + I > \Q\, there must be some k, where < k < i, such that Ut £ [ji>j>k Uj. Thus, for every 
q & Uk and / e Z the strategies cr^j, j > I induce strategies in for reaching Uk x {I - \,l - 2, . . .} from 
q(l) with a probability arbitrarily close to 1 . This allows us to construct strategies for satisfying CN^ with 
probability arbitrarily close to 1 from every q(l), q e Uk, / e Z. Indeed, for an arbitrary 6 > consider the 
sequence {Sj}J^^, where Sj = 6- 2~L For every w e{Qx 'Ef which starts with some q{J) e UkXl.'WQ denote 
min-step every index j such that 

- w{j) - q{m) for some q € Uk, m eZ, 

- for all h such that <h < j we have that if w(h) - q(m'), then q ^ Uk ot m' > m. 
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We define a strategy t by setting t(w) - Tj(w') where j is tlie number of min-steps in w, w' - w(m) ■ ■ • wdwl- 
1) with m being the last min-step, and tj is a Jy-optimal strategy for satisfying CN^ from w{m). It follows 
that P{CN'^^{q{l))) > Y\J=i I -Sj > 1-6. Since the strategies (Tdj also induce strategies for reaching U^xZ 
from p(i) with probability arbitrarily close to 1, we proved (1). 

By applying Theorem 4, we can conclude that our theorem is true for all configurations of the form p{i) 
with p & Q, i > \Q\, since an optimal CMD strategy for CN^ induces directly an optimal CMD strategy in 
for NT. Let us denote this strategy by cr. 

Consider now the case p(i) when i <\Q\. Let 

A = ValOne''^ n {qij) \qeQ,j> \Q\] 

Consider a finite MDP D with vertices Q x {0,\, . . . ,\Q\] such that for all q e Q the vertices q(\Q\) are 
stochastic with only one transition q(\Q\) q(\Q\) and the rest is just restriction of transitions and probabil- 
ities from D^. Then the following is equivalent due to standard results for finite MDP (see e.g. [7]): 

- p(Pi e ValOne^'^ 

- There are strategies in for reaching AU {Qx {0}) from p{i) with probability arbitrarily close to 1. 

- There are strategies in D for reaching {A {Q x {\Q\})) U {Q x {0}) from p(i) with probability arbitrarily 
close to 1 . 

- There is a MD strategy r in £) computable in polynomial time for reaching (An(Qx {\Q\})) U ((2 X {0}) 
from p{i) with probability 1. 

- p{i) € OptValOne^^ with the witnessing strategy being t extended with a for configurations q{m), 
m > \Q\. 

We have already defined a CMD strategy cr such that P{NT"'{p{i))) = 1 for all i e N and p such that 
{;?} X N c ValOne^^ (call these p safe). To finish the proof of our theorem, it remains to redefine cr for 

I^IT AIT 

configurations p{i), p e Q, i < \Q\ such that p{i) e ValOne but p{\Q\) ^ ValOne (call these p unsafe). 
Note that due to (1) every p e Qis either safe or unsafe. For every unsafe p there is some ip < \Q\ such that 
pii) e ValOne^"^ iff / < ip. Take the MD strategy t such that 'P{NT''{p{ip))) = 1. Note that this strategy can 
be chosen one for all such p(ip). We now redefine the CMD strategy cr by redefining its selector /: f(p) is 
the rule generating the transition chosen by r in p{ip). Since no configuration with an unsafe state is reached 
from a configuration with a safe state under cr this does not influence the property that P{NT'^{p{i))) = 1 for 
all safe p. Moreover from the definition of / and the choice of ip, almost all runs from p{i), i < ip under cr 
either visit a configuration with a safe state or a configuration from Q x {0} or q(iq + i - ip) with q unsafe. 
Thus by double induction, first on 1 21 - ip then on /, for all unsafe p and / < ip we have 'P(NT"'(p(i))) = 1. 
Since ValOne^^ - {{q, i) \ q is safe, / e N} U {{q, i) \ q is unsafe, / < iq}, we have proved the theorem. □ 

Remark 27. Let I ^ {p e Q \ pii) e ValOne^ for all i e N}. Then 

- for every q e I H Qpwe have that if(q, c, q') e 6^^, then q' e I; 

- ifqelnQN, then there is (q, c,q') e 6^^ such that q' e /. 

This means that we can define a OC-MDP Jij obtained from by 

- restricting the set of control states to I; 

- restricting the set of positive rules to the rules of the form (q,c,q') where q,q' el and the probability 
assignment is preserved; 

- redefining the set of zero rules to {{q, 0,q) \ q e I]. 
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It follows from the proof of Theorem 12 that for every configuration p(i) of we have that Val (p(i)) - 
Vaf^ipii)) = 1. 

Now we give the promised example which demonstrates that the inclusion OptValOne^^ c ValOne^^ is 
proper. Consider the OC-MDP ^ of the following figure (we draw directly the associated MDP D't)'- 

P 

r 

s 

1 2 3 4 5 6 

The control state p is non-deterministic, and the other two control states are stochastic. The probability 
distributions are always uniform, and the only final control state is s. Now observe that OptValOne^^ - 
{s{i) I / e No), while ValOne^^ consists of all p{i), s{i), i e Nq. To see this, let us fix an arbitrarily small 
e > 0, and choose some c e No such that ^ < f • We define a MD strategy cTg by <Ts(p(k)) - p(k + 1) if 
k < c, and cri;(p(k)) - r{k) \ik > c. Now it is easy to check that 'P{ST"''''{v)) >l-^>l-efor every v 
of the form p{i), or s{i). On the other hand, there is no strategy cr such that 'P{ST°' {p{i))) = 1 for any / e No 
because every strategy which makes the probability of reaching ^(0) from p(i) positive inevitably makes the 
probability of reaching r(0) positive as well. 

Note that the strategy cTg from the above example is in fact both MD and FD strategy (see the definition 
after Lemma 21), i.e. finitely representable by a deterministic finite automaton. This is always the case for 
strategies approximating the Val^^ up to some fixed e > 0. This is because if some strategy cr satisfies 
P(ST"'(v)) > Val^^iv) - s/2 then there is some n e N such that the probability of runs from ST"'(v) not 
longer than n is at least Vaf^{v) - s. On these runs only finitely many configurations appear and thus 
the choices of cr in these configurations can be kept in a finite memory of a finite automaton. Thus the 
strategy cr can be replaced by a FD strategy cr' copying the choices of cr until the «-th step. It follows that 
PiST'^'iv)) > yaF(v) - e. 

Now we present an exponential-time algorithm which computes an J?[-automaton recognizing the set 
OptValOne^^, and we also show that there is a counter-regular strategy cr constructible in exponential time 
which is optimal in the configurations of OptValOne^^ . We also give a lower complexity bound and show 
that deciding the membership to OptValOne^^ is PSPACE-hard, and the membership to ValOne^^ is hard 
for the Boolean hierarchy over NP (note this hierarchy subsumes both NP and coNP). We did not manage 
to provide analogous results for ValOne^^, and we leave this problem as an open challenge for future work 
(the above example gives a taste of issues that must be resolved to obtain a solution). 
To prove Theorem 15, we need to formulate several auxiliary observations. For every i e No, let 

- Blacki = ipii) € e X No I Riip) = b} 

- Whitei = [p(i) e 2 X No I Ri(p) = w} 

Further, let White - UisNo Whitei. 

Lemma 28. There is a MD strategy cr such that for all < j < i and all p{i) e Blacki we have that 
PiReach'^i^^^ipm = 1 andPiReach'^i^.^^ipm = 0. 

Proof. It is known that for every finitely-branching MDP D = {V, ^ , (VV, Vp), Prob), every set T c y 
of target vertices, and every initial vertex v e y, if there is some (i.e., HR) strategy Jiy such that 
P{Reachj' (v)) = 1, then there is also a MD strategy cr^ with this property (see, e.g., Theorem 7.2.11 of 
[23], which applies to more general non-negative bounded total expected reward objectives). The individual 
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Fig. 2. The structure of coloring R (where N = 2l2l). 



MD strategies cfy can be easily combined into a single MD strategy cr. Since is finitely-branching, we 
can apply this generic result and conclude that there is a MD strategy cr such that P{ST"'{p(i))) - 1 for 
every p(i) e Q xMq where R(p(i)) - b. This means that also P(ReachQ^y^{p{i))) - 1 for every j such 
that < j < i. Now suppose that P(Reach^-j^(p(i))) > 0. Then there is some white configuration qif) 
such that P(Reach'^^^j^^(p(i))) > 0. Since q(j) is white, we have that P(ST"'(q(j))) < 1. Thus, we obtain that 
P{ST'^{p{i))) < 1, which is a contradiction. Since 'P(Reach'^^y^{p(i))) = 1 and PiReach^-^^ipii))) = 0, we 
have that PiReachli^^^ipii))) = \. U 

Lemma 14. There is \ <€ <N such that, for every j > N, the columns Rj - Rj+f. 

Proof. We show that for all j,k e'N we have that if Rj - Rk, then also Rj+\ - Rk+i- From this we easily 
obtain our lemma — since there are at most different columns, there are m, « e N such that < m < n < N 
and Rm = Rn- We put € = n-m. Obviously, Rj - Rj+i for every j > m. Since m < N,vje are done. 

It suffices to prove that for every / e N, the column is completely determined by the column 
Ri in the following sense: For every q e Q v/e have that Ri+i(q) - b iff there is a strategy cr such that 
P(Reachgi^^^.,(q(i+l))) = 1 and P(Reach'^f^.^^ (q(i+\))) - 0. Note that the existence of cr does not de- 
pend on the exact value of / as long as the column /?, stays the same. Hence, the above claim implies that 
if Rj = Rk, then also Rj+i = Rk+i- It remains to prove this claim. The direction follows directly 
from Lemma 28. For the direction, consider a strategy cr such that P{Reach'^^^^^ {q{i+\))) - 1 and 
P{Reach'^.j^Xqii+l))) = 0. For each p(i) € Blacki there is a strategy CTp such that 'P{ST"'"{p{i))) = 1. 
Hence, we can construct a strategy n which behaves like cr until some p(i) e Blacki is reached, and from 
that point on it behaves like crp. Obviously, P{ST''{q{i+l))) = 1 as needed. □ 

Now we show that the initial and periodic rectangles of the coloring R (given in Figure 2) are computable 
in exponential time. For this we need to formulate and prove an important observation which establishes a 
powerful link to the results presented in Section 3. We start by defining a OC-MDP ^r/, which encodes 
the structure obtained by deleting all white points from the periodic rectangle of R. Later, we construct such 
an automaton also for another coloring B, where some points are gray. Therefore, the definition of is 
parametrized by a general coloring which satisfies certain conditions. 

Definition 29 (the OC-MDP ^c,e)- Let C : 2 x No ^ {b, w, g] be a coloring such that Cm - C^+t and for 
every p{N+i) e QxN where \ <i <t and C{p{N+i)) ^wwe have that 

(1) if p(N+i) is probabilistic and p{N +i) ^ q{N + j), then C{q{N+k)) + w, where k = j mod £; 

(2) if p{N+i) is non-deterministic, then there is some p{N+i) q{N+j) such that CiqiN+k)) + w, where 
k - i mod €. 

We define a OC-MDP where 
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- the set Qc,e of control states of^c,e consists of all [p, i] where p ^ Q, \ <i <i, and C{p{N+i)) i^w. A 
given control state [p, i\ is non-deterministic or probabilistic, depending on whether p e Qjq or p e Qp, 
respectively; 

- the set of zero rules consists of all triples {[p, i], 0, [p, i]), where [p, i] e Qc,e! 

- the set of positive rules is constructed as follows: 

• for all (p, c, q) e 5^'' and all i e N such that \ < i < I, \ < i+c < £, and [p, i], [q, i+c] e Qc,e, we add 
a rule {[p, i],0, [q, i+c]). If [p, i] is probabilistic, then the probability of the rule {[p, i], 0, [q, i+c]) is 
P^\p,c,q). 

• for all (p,c,q) e 6^^ and all i e N such that I < i < £, i+c - €+\, and [/?,?], [^,1] € Qc,e> we 
add a rule {[p, i\, 1, {q, 1]). If[p, ?] is probabilistic, then the probability of the rule {[p, ?], 1, {q, 1]) is 

P''^(p,c,q). 

• for all ip, c, q) e 6^^ and all i e N such that \ < i < (, i+c = 0, and [p, i], [q, €] € Qc,(, we add 
a rule {[p, ?], -1, €\). If {p, i] is probabilistic, then the probability of the rule {[p, ?], -1, €]) is 
P^\p,c,q). 

Observe that conditions (I) and (2) guarantee that Jic,{ is indeed an OC-MDP. 
Lemma 30. For each configuration [p, /]( /) of { we have that Val^-^ {[p, /](/)) = 1. 

Proof Let [p, i]{f) be a configuration of ^r/. By definition of we have that R(p(N+i+ j€)) - b. By 
Lemma 28, there is a MD strategy cr such that P(Reachgi^^,i^^(p(i))) = 1 and P(Reach'^f^.^^(r(m))) = for 
every r(m) e 2 x No where R{r{m)) = b. Consider a MD strategy n in 'D'^^^ defined as follows: for every 
configuration [q, k]{n) of J[r^( where q e we put n{{q, k]{n)) = [q', k']{n'), where 

- o-{q{N+k+n{)) = q'{t), 

- k' = (t-N) mod £, 

- n' ^(t-N) + €. 

Note that the definition of n is correct, because R(q'{t)) - b and hence the transition n{[q,k]{n)) - 
[q',k'](n') exists in D^^^ (reahze that if R(q'(t)) was white, we would have a contradiction with 

P(Reach'^f^j^^{q(N+k+ni))) = 0). Since almost all runs of D^(cr) initiated in p(N+i+ jl) visit Blacky, we 
obtain that almost all runs of D^^^in) initiated in [p,i]ij) visit a configuration of the form [q,i](0). This 

means that Val^-. i[p,i]ij)) = 1- □ 

Lemma 31. Let Jl = (Q, 6=^, {Qn, Qp), P=°, P'^^) be a OC-MDP. If p(i) *q(0), then there is a path 
from p(i) to q(0) in IX^ such that the counter stays bounded by i+\Q!^ along this path. 

Proof. For every j e No, we define a relation Q Qx Q inductively as follows: 

- -^Q^l{s,t)eQxQ\s{l)^t{0)] 

- '^j+i consists of all {s,t) e Qx Q such that one of the following conditions is satisfied: 

• s t; 

• 5'(1) i-> r(l) for some r e Q such that r '^j t; 

• >y(l) i-> r(2) for some r e Q such that r u and u '^j t for some u e Q. 

A straightforward induction on j reveals that if s "^j t, then there is a path from to t(0) in along 
which the counter stays bounded by j + I. 



25 



Let ^ - UjeNo '^j- Observe that = '^iQp- One can easily show that 5 ^ f iff for every / e N there 
is a path from s{i) to t{i-l) such that the counter is less or equal to / + \Q\^ and greater or equal to / in all 
configurations except for the last one (the direction is proven for every by induction on j, and the 
direction is proven by induction on the length of a path from s(i) to t(i-l)). From this we get that if 
there is a path from s(i) to f(0) in such that the counter stays positive in all configurations except for 
the last one, then there is a path from s{i) to t(Q) along which the counter is bounded by / + \Q\^. Finally, 
we show that if there is a path from s{i) to /(O) along which the counter becomes zero m times, then there is 
a path from s(i) to t(0) along which the counter is bounded by i + \Q\^ (this is the result we are aiming at). 
However, this is easy to prove by induction on m. □ 

Lemma 32. There is a counter-regular strategy cr which is optimal in every configuration of OptValOne^^. 
Further, the underlying ^-automaton and selector function of the strategy cr are computable from the initial 
and periodic rectangles of the coloring R in time which is exponential in the size of Si. 

Proof We design a MD strategy n such that 

- ;r is optimal in every configuration of OptValOne^^. 

- nip{i)) = nip{i+i)) for all p e Qn and / > \Q\^N^ + N. 

- n(p(i)) is computable for all p e Qf^ and i < \Q\^N^ + N + {m time polynomial in A^, assuming that the 
initial and periodic rectangles of R are known. 

Obviously, the strategy tt can be easily transformed into a counter-regular strategy cr with the required 
properties. 

First, for every p{i) such that / < and R(p(i)) = ^ we fix a finite path p{i) • • • ^(0) where q e F 
and all configurations in the path are black in R. Such a path must exist, and we can further safely assume 
that the counter stays bounded by \Q\^N^ + N along this path (see Lemma 31) and no configuration appears 
twice in the path. For all configurations q{Q) where q e F D Qn, the strategy n is defined arbitrarily. Now, for 
each path w fixed above (in any order) we do the following: we identify all non-deterministic configurations 
qij) in w for which the strategy n has not yet been defined, and let n{q{j)) to select the (only) outgoing 
transition of q{f) that appears in the path w. Let PathConf be the set of all configurations (non-deterministic 
or probabilistic) that appear in some of the finite paths fixed above. 

Now consider again the OC-MDP J^Ir/. According to Theorem 12 and Lemma 30, there is a CMD 
strategy ^ in O^^^ such that for every configuration [p,i]{j) of SIr/ we have that P{NT^{[p,i]{j))) = I. 
For every control state [p,i] of where p e Qn, let [p,i](l)t-^ [q,j](k) be the transition selected by 
^([p, /](1)). For every p(N+i+y£) such that y G No and n{p{N+i+yi)) has not yet been defined, we let 
n{p{N+i+yC)) to select the transition p{N+i+yt) ^ q{N+j+y£+{k-\)£). 

Obviously, we have that n{p{i)) = n{p{i+i)) for all p e and / > \Q,^N'^+N. If the initial and periodic 
rectangles of R are known, the automaton is effectively constructible by using Definition 29, and the 
CMD strategy ^ is computable in time polynomial in by Theorem 12. Hence, n(p(i)) is computable for all 
p & Qn and / < \Q\^N^ + N + { in time polynomial in A^. To see that n is optimal in every configuration of 
OptValOne^^, realize the following: 

- Let White = {q{j) € QxMq | Riqij)) - w). Then for every p(i) e OptValOne^^ we have that 
nReach1^i^.^^{pm = 0. 

- Let fin - {q(0) \ q e F}. Then there is a fixed e > such that for every p(i) e PathConf we have that 
P(ReachJ^^(p(i))) > s. This is because for each of the finitely many p{i) e PathConf there is a finite path 
from p{i) to fin in O'Zin)- 
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Input: An OC-MDP ^ = (Q, S=°, S"", (Qn, Qp), P=°, P^°), a non-empty set F c Q of final states. 
Output: The initial and periodic rectangles of the coloring R. 



1 : for each where < (' < 2Af do A(p(iy) := w done 

2 : for each £ where 1 < ^ < do 

3 : for each C where C : Q {b,w] do 

4: for each p(i) where < ; < A' + ^ do B{p{i)) := g done 

5 : for each ^ € F do B{q{0)) := b done 

6: Bff := C; B^+e '■= C 

7 : repeat 

8 : for each p{i) where < i < Af + ^ do B(p(i)) := check_color(p(0) done 

9 : until B does not change 

10 1 if B(p(i)) = r for some p{i) then continue with the next C 

1 1 : compute the OC-MDP ^e.e 

12 : for each p(N+i) where 1 < ; < ^ and B(p(N+i)) i^wAo 

13 : B{p{N+i)) = check_value(p(A?-H;)) 

14 : done 

15 : if B(p(i)) = r for some p{i) then continue with the next C 

16 : repeat 

17 : for each p{i) where < i < iV do B{p{i)) := check_path(p(i)) done 

18 : until B does not change 

19 : for each p(i) where < i < N and B{p{i)) = g do B(p(i)) := b done 

20 : if B{p{i)) = r for some p{i) 

2 1 : then continue with the next C 

22 : else transfer all black points of B to A 

23 : done 

24 : done 

25 : find the least { such that A^i = A,v+f 

26: output Ao, . . . ,Ajv and A^+i, . . .,An+t 



Fig. 3. An exponential-time algorithm which computes the coloring R 

- For each p{i) e OptValOne^^ \ PathConf we have that 'P(^^(ichp^^i^f^g^j:{p(i))) - 1 . This is because al- 
most all runs in D^in) initiated in p(i) tend to decrease the counter until they reach a configuration of 

PathConf. 

From these three properties, one can conclude that P{ST^(p(i))) - 1 for every p(i) e OptValOne^^ . □ 

Theorem 15. An ^-automaton recognizing the set OptValOne^^ is computable in exponential time. Fur- 
ther, there is a counter-regular strategy cr constructible in exponential time which is optimal in every con- 
figuration of OptValOne^^ . 

Proof. To construct an J?l-automaton recognizing the set OptValOne^^, it suffices to compute the initial and 
periodic rectangles of R. This is achieved by the algorithm given in Fig. 3. 

Since the width of the initial rectangle is N + I and the width of the periodic rectangle is at most A'^, it 
suffices to compute the first 2A^ -l- 1 columns of R. For this purpose, we introduce two auxiliary colorings A 
and B whose domain is restricted to 2 x {0, ... , 2N}. The coloring A is just a memory used to accumulate 
the information about all of the newly discovered black points. The color of all points in A is initially white 
(line 1) and, as we shall see, each p{i) such that < i <2N and R{p{i)) = bis eventually recolored to black 
in A at fine 22. 

The coloring B is used to discover more and more points that are black in R. This is achieved by trying 
out all candidates £ for the width of the periodic rectangle (fine 2) and all candidates C for the column RN+e 
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(line 3). For each choice of i and C, the color of all p{i) in B, where < / < N+l, is first initialized to gray 
at line 4 (the intuitive meaning of gray is "don't know"). Then, all ^^(O) where ^ e F are recolored to black 
at line 5, which is surely correct. Further, the columns Sjv+^ and B]^ are set to the current candidate C (note 
Rn+{ - Rfi)- Now, we try to recolor as much points as we can using the function check_color (lines 7-9). 
For a given p{i), where < / < N+£, the function check_color first computes the set of col(p(i)) of colors 
that p{i) should have according to its i-> successors and predecessors (we say that q{j) is a successor of 
r{k) if r{k) qij))- Formally, col{p{i)) is the least set of colors satisfying the following: 

- if /7 e 2p and all i-^ successors of p{i) are black in B, then b e col{p{i)); 

- if p e Qp and some i-^ successor of p{i) is white in B, then w e col{p{i)); 

- if /7 e Qn and all successors of p{i) are white in B, then w £ col(p(i)); 

- if p e Qn and some i-> successor of p(i) is black in B, then b e col(p(i)); 

- if qij) i-> p{i) where q & Qp and q{j) is black in B, then b e col{p{i)); 

- if q(j) 1-^ p{i) where q^ Qn and q{j) is white in B, then w e col{p{i)). 

Note that in the case when / = N+l, we need to know the B color of i-^ successors and predecessors of 
p{i) whose counter value can also be N + € + 1. Here we stipulate that B{q{N+£+l}} — B{q{N+\}} (note 
that R(q(N+£+l)) - R(q(N+l)). Intuitively, check_color(/7(0) contains the color of p(i) that is "enforced" 
by the colors of its i-> successors and predecessors. If both black and white is enforced, or if B(p(i)) is 
inconsistent with the enforced color, we discovered an inconsistency in the current choice of £ and C. Hence, 
the color which is returned by check_color(/)(/)) is determined as follows: 

- if col{p{i)) = 0, then check_color(;7(/)) returns B{p{i)) (i.e., the current color of p{i) in B); 

- if col{p{i)) = {c} and B{p{i)) = g, then check_color(p(/)) returns c; 

- if col(p(i)) - [c] and B{p{i)) - c, then check_color(/>(0) returns c; 

- in the other cases, check_color(;7(0) returns r. 

Note that the red color is used to mark a consistency error. Also note that each p{i) is recolored at most twice, 
and so the repeat-until loop in lines 7-9 terminates after 0{N) iterations, where each iteration invokes the 
function check_color only 0{N) times. 

After terminating the loop in lines 7-9, the algorithm checks if there is a red p{i) and if it is the case, 
it rejects the current C and continues with the next candidate (Une 10). Otherwise, all points in B are either 
black, white, or gray, where 

(1) for all p{i) such that B(p(i)) - gv/e have that check_color(/>(0) returns g; 

(2) for all p(i) such that B(p(i)) + gv^e. have that if the width of the periodic rectangle of /? is f and Rn+{ - C 
(i.e, the current candidates i and C are the "real" ones), then B{p{i)) = R{p{i)). It is easy to show that 
this claim is an invariant of the repeat-until loop in lines 7-9. 

Now we need to resolve the color of the remaining gray points. First, we concentrate on the gray points 
in the columns B^+i, . . . ,Sjv+^ and check whether they can constitute the periodic rectangle of R after 
some further recoloring. This is done by checking the condition of Lemma 30. First we construct the 
OC-MDP Jis/ of Definition 29 (line 11). Note that the condition (2) above guarantees that the color- 
ing B satisfies the requirements of Definition 29. For each p{N+i) where \ < i < £ and B{p{N+i)) + w 
we recolor p{N+i) to check_value(S(p(A/^+/))) at lines 12-14. Here the function check_value does the 
following: if B(p(N+i)) - g, then check_value(S(/>(A^+0)) returns either borw depending on whether 
Val^^^j^^^^([p, = 1 for all j e No or not, respectively. If B(p(N+i)) + g, then check_value(B(/7(A/^-i-0)) 
returns either b or r, depending on whether Val^.^,^ .([p, - I for all j e No or not, respectively. Note 
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that check_value(S(j7(A'^+0)) is computable in time polynomial in the size of by Theorem 12. Then we 
check whether some point has been recolored to red, and if it is the case, we continue with the next candidate 
(line 15). Otherwise, all points in the columns S^r+i, . . . , S^y+f are now black or white. It is important to note 
that the functions check_color and check_value would not report any inconsistencies in the current B 
(i.e., if we run the code at lines 7-14 again after line 15, no point would be recolored to red). This follows 
directly from Remark 27. 

It remains to resolve the gray points in the columns Bq,..., S^y. Here we use the observation about R 
formulated in Lemma 32. Let B be the (only) coloring satisfying the following conditions: 

- Bj = Bj for every < < N+£; 

- BN+i+i = Bn+i for every i e N. 

For every p{i) where < i < N and B(p(i)) + w, we recolor p(i) to check_path(/)(/)). The function 
check_path(p(/)) checks, depending on whether p is probabilistic/non-deterministic, whether for all/some 
p(i) 1-^ rif) there is a finite path r(j) i-^ • • • ^(0) such that q e F and all configurations in this path are 
black or gray in the current B. If this is the case, check_path(p(/)) returns the current B(p(i)). Otherwise, 
check_path(p(/)) returns either white or red, depending on whether B{p{i)) - gor B{p{i)) - b, respectively. 
After finishing the loop at lines 16-18, all of the remaining gray points of Bq, . . . , are recolored to black at 
line 19. Note that the function check_path can be implemented in time polynomial in by employing, e.g., 
standard polynomial-time algorithms for the reachability problem in pushdown automata. Then we check 
whether some point has been recolored to red, and if it is the case, we continue with the next candidate (line 
21). Otherwise, all points of Bq, Bn+£ are black or white. Observe that 

- for every p(i) such that / < and B{p{i)) = b there is a finite path p(i) i-^ • • • ^(0) where q e F and 
all configurations in the path are black in B. Further, ii p e Qp and p{i) r{j), then B(r(j)) - b. 

- there is a CMD strategy ^ in such that for every configuration [p, of we have that 

nNT^{{p,m)) ^ 

These are exactly the ingredients which were needed to construct the strategy n in the proof of Lemma 32. 
If we apply the same construction to the coloring B, we obtain a strategy ns such that 'P{ST"^{p{i))) - 1 for 
every p{i) e 2 x No where B{p{i)) - b. This means that all black points in the columns Bq, B^+t can be 
safely transferred from B to A, which is done at line 22. 

After terminating the loop at lines 2-24, the algorithm finds the least C such that At^ - A]si+t, and outputs 
the rectangles Aq, . . . ,Af^ and A^+i, . . . ,Af^+f. Since the "real" values of € and C are eventually tested as 
candidates and the algorithms recolors a gray point to a white point only if some condition satisfied by R 
is violated, all black points of Rq, . . . ,RN+t are eventually discovered. Since the functions check_color, 
check_value, and check_path need only polynomial time in the size of A^, the whole algorithm is polyno- 
mial in the size of A^. 

After computing the initial and periodic rectangles of R, a counter-regular strategy cr which is optimal 
for all configurations of OptValOne^^ can be constructed by using Lemma 32. 

Theorem 16. Membership in ValOne^^ is BH-hard. Membership in OptValOne^^ is PSPACE-hard. 

Proof. We start with proving the BH-hardness. Our proof is essentially a variation on a proof by Serre 
[24] (using a technique that originated in [18] and was later reshaped in [16]) showing that the reachability 
problem for non-probabilistic 2-player 1-counter games is DP-hard. We show that similar arguments work 
to show BH-hardness for OC-MDPs. 
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First, we show that membership in ValOne^^ is NP-hard and coNP-hard, and then we show how to 
combine these to get BH-hardness. 

We start with NP-hardness. We reduce from SAT. Suppose we are given a CNF formula i// = CiA.. . aC^, 
over variables [xi,..., Xy]. We will encode assignments to the variables of il/ by integers, as follows. Let 
ni,...,nr denote the first n prime numbers. Then an integer n corresponds to an assignment that assigns 
true to Xi if and only if tt,- divides n. Note that multiple integers map to the same assignment, but that all 
assignments are certainly mapped to by some positive integer (e.g., 1 assigns false every variable). It follows 
from the strong forms of Bertrand's postulate (see, e.g., Theorem 5.8 in [25]) that (as a very conservative 
bound), for all r > 6A,nr < (2r)^. (We can thus of course trivially compute the first r primes ;ri, tt^ in 
time polynomial in r.) 

The OC-MDP will have a start state .vo, which is controlled by the (maximizing) player. The initial 
configuration is sq{\) and the player can choose to increment the counter and stay in state or to move to 
state s\ without changing the counter. Thus, after it has repeatedly incremented the counter up to a "guessed" 
number n > which represents an assignment, the game moves to configuration s\{n). 

State sy is probabilistic, and it chooses, uniformly at random, one of the clauses C,-, which it claims 
is not satisfied by the assignment associated with n, and moves to configuration s'^{n). s'^ is controlled by 
the maximizing player, and it chooses a literal Ij in C,-, and moves to *■; («). Suppose Ij - xj. From this 
configuration we deterministically decrement the counter, but keep track, using nj auxiliary states, how 
many times, mod nj, we have decremented the counter. Clearly, if we hit the counter value in a state that 
indicates we have decremented a number of times which is (mod nj), then the assignment corresponding 
to n satisfies clause C,-. Similarly, if Ij = -^xj, we can check that the number of times decremented is 
(mod TTj), in which case again n satisfies clause Q. Since the random player chose all clauses with equal 
probability, there is a strategy to terminate in such "accepting" states with probability 1 if there is a satisfying 
assignment to ^. Also note that if there is no satisfying assignment to ifr, then there is a fixed 6 > such 
that for every strategy the probability of non-terminating or terminating in a "non-accepting" control state 
is at least 6. Note that, as it is easy to check using the bound TVr < (2r)^, the size of the resulting IC-MDP is 
polynomial in the size of the formula ifr. 

Next, for coNP-hardness, suppose we have a CNF formula ijj = C\/\. . ./\Cni, over variables {x\,. . ., x^}, 
and we want to decide unsatisfiability. We do as before, but with some role reversals between non- 
deterministic and probabilistic control states. Starting in configuration so(l) where sq is now probabihstic, 
we randomly either increment the counter or change the state to s\ (with, say, equal probability). Thus we 
eventually move to state .vi with probability 1 , and for every positive integer n, with some positive probabihty 
we move to {si,n). The state is controlled (i.e., non-deterministic). 

The player's strategy chooses (guesses) a clause Q which it thinks caimot be satisfied by the assignment 
n, and moves to configuration j-(n), where j. is probabihstic. Then the random player picks one of the 
literals Ij, of clause C,, uniformly at random (intuitively claiming at least one of them will be satisfied and 
thus with positive probability we will terminate in a rejecting state), and moves to s'. , («). We then decrement 
deterministically as before, except that now when we terminate we accept precisely in those states where 
we would have not accepted before. Specifically, we accept if "assignment" n did not assign true to literal Ij 
of clause C,, which again we can check by keeping track of how many times we decremented mod tt,-, upon 
hitting counter value 0. 

Note that under every strategy the probability of termination is 1 . Similarly as before, there is a strategy 
such that the probability of termination in an accepting state is 1 if there is no satisfying assigimient to i/f, on 
the other hand there is some 6 > such that terminating in a "non-accepting" state occurs with probability 
at least 6 under every strategy if there is a satisfying assignment to i//. 



30 



Finally, to show BH-hardness, consider any statement which is a A-V combination of statements of the 
form "ij/i is satisfiable" and "tpj is not-satisfiable", where i/'/'s are Boolean formulas. Deciding whether such 
statements are true is BH-complete. In order to mimic this with a OC-MDP, we do as follows: V is mimicked 
by the controller (i.e., a non-deterministic state) picking one of the disjuncts. A is mimicked by the random 
player (a probabilistic state) picking one of the conjuncts uniformly at random. When we hit a statement 
'Vi is (un)satisfiable", we play the corresponding game. It is easy to check that maximizer has a strategy to 
terminate in an accepting state with probabihty 1 if the entire statement is true, and that there is a 5 > such 
that for every strategy termination in an accepting state has probability at most 1 - 5 if the entire statement 
is false. 

Note that in all the OC-MDP from the reductions above the sets OptValOne^^ and ValOne^^ are equal. 
Thus we have already proved also BH-hardness of the membership in both of them. We will now prove, 
however, that the membership in OptValOne^^ is even PSPACE-hard. 

The proof is by reduction from the emptiness problem for simple alternating finite automata over a one- 
letter alphabet. A simple alternating finite automaton over a one-letter alphabet (call it AFA for short in the 
rest of the text) is a tuple {Q,6,qo, F) where 2 is a finite nonempty set of states, qo ^ Q, F Q Q and 6 
is a transition function assigning to every state either another state, or the "existential" pair p V q of states 
p,q e Q, or the "universal" pair p A q. The automaton is used to recognise sets of words over a one-letter 
alphabet. Such words can be considered as numbers from Nq. The language of the automaton is defined to 
be the set of exactly those n e No which are accepted from the state qo, written Acc(^o, n). The semantics of 
the expression Acc{q, n), meaning accepting a number n from a state q, is defined inductively on n: Acc{q, 0) 
is true iff ^ e F. For n = A: -i- 1 we have three cases: 

- If 5{q) - p then Acc{q, A: -t- 1) is equivalent to Acc(p, k). 

- If 6(q) = p\ y P2 then Acc(q, k + I) is true iff at least one of Acc(pi,k) and Acc(p2, k) is true. 

- If 6{q) = p\ A p2 then Acc{q, k + I) is true iff both Acc{p\, k) and Acc{p2, k) are true. 

See [17] for more details about AFA. Proposition 4 from [17] states that the problem of deciding whether 
the language of a given AFA is empty, is PSPACE-hard. 

We now describe a log-space reduction of the emptiness problem for AFA to the membership in 

OptValOne^'^ for OC-MDP Let (Q,d,qQ,F) be an AFA. The reduction returns the following OC-MDP: 
{Q U [p], 6=^, (Qn, Qp), P"^, P^'^) along with the set F of final states and the initial configuration p{l) 
where 

- is a fresh new state, p t Q; 

- 3='' ^{(p,0,p)]U{(q,0,q)\qeQ]; 

- (J^" = {(p, +l,p),(p, -1,(7())} U {(q, -1, r) \ q,r e 2, whenever r occurs in 6(q)y, 

- QN = {p}^{q^Q\^r,seQ: 6{q) = rVs},Qp = Qx Q^, 

- the probability assignments always return the uniform distribution. 

If n is accepted by the AFA then the following MD strategy cr proves p(\) e OptValOne^^: 

- cr{p{n + 1)) - qoin) and cr{p{k)) - p(k -I- 1) for A: n -l- 1, 

- o-(q(k)) - r(k - 1) for every q ^ Qn and A e N where r is an arbitrary state occurring in 6{q) with 
Acc{r, A - 1) being true, and 

- o-{q{k)) is defined arbitrarily if there is no such r. 

On the other hand, if a ensures almost sure reaching F x {0} from p{\), there must be some n such that 
qo(n) is visited on some path from ;7(1) to F X {0} with positive probability. It can easily be shown that 
every configuration of the form q(k) visited after qo(n) satisfies Acc(q,k). In particular Acc(qo,n) and thus 
the language of the AFA is not empty. □ 
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We now show that qualitative problems for the special subclass of OC-MDPs given by solvency games 
[1] can be solved in polynomial time. We now recall more formally the definition of solvency games 
from [1], which was described informally in the introduction. A solvency game, is given by a positive 
integer, n, (n is the initial pot of money belonging to the gambler), and a finite set J?I = {Ai, . . . ,Ak} of 
actions (or "gambles"), each of which is associated with a finite-support probability distribution on the 
integers. Since for computational purposes we have to be given these distributions as finite input, we as- 
sume that the distribution associated with each action Aj, i - I, . . . ,k, is encoded by giving a set of pairs 
{(ni,i,Pi,\), («i,2' Pi,2)' ,mi-, Pi,mi)}r such that for j — I,..., mi, nij e Z and pij are positive rational 

probabilities, i.e., pij e (0, 1] and Tljii Pij - 1- We assume the integers n,-,y and the rational values pij are 
both encoded in the standard way, in binary notation. 

In a solvency game the player (or gambler or investor) starts with the initial pot of money, n, and has 
to repeatedly choose an action (gamble) from the set If at any time the current pot of money is n', and 
the gambler then chooses action A,-, then we sample from the finite-support distribution associated with A,, 
and the integer, d, resulting from this random sample is added to n' , obtaining the new pot of money n' + d. 
If the pot of money hits or goes below zero, then the gambler loses (goes bankrupt) and the game ends. 
Otherwise, we repeat the gambling process with the new pot of money n' + d. The gambler's aim is to 
minimize the probability of ever losing the game, i.e., to minimize the probability of ever going bankrupt. 
(Note that we do not allow the gambler to simply choose to stop gambling (which would be too easy a way to 
prevent going bankrupt). Our gamblers are hopelessly addicted! Perhaps then investor is more appropriate.) 

It should be clear that solvency games constitute a special subclass of OC-MDPs. Namely, the counter 
in an OC-MDP can be used to keep track of the gambler's wealth. Although, by definition, OC-MDPs 
can only increment or decrement the counter by one in each state transition, it is easy to augment any fi- 
nite change to the counter value by using additional states and incrementing or decrementing the counter 
by one at a time. Namely, the OC-MDP will have a "base" control state, s, from which is chooses 
from the set of actions {Ai,...,A^}. If action A,, is associated with a probability distribution given by 
{(nij, Pi,i),(ni^2, Pi,2), ■ ■ ■ ,(ni,mi, Pi,mi)], wc wiU have additional auxiliary states associated with each 
such integer nij in the support of A, . After the gambler chooses action A,, we transition from state 5 to a new 
random state Sj without changing the counter value. From we move with probability pij to a new state 
Sij, from which we will deterministically (with probability 1) add nfj to the counter, doing the incrementing 
or decrementing one at a time, by going through nij additional states Sijj, Sij^„.j. Finally, after this 
is done we return to the "base" control state s. It is easy to see that the original solvency game with the 
objective of minimizing the probability of bankruptcy is equivalent to the resulting OC-MDP, started in state 
s, with the objective of minimizing the probability of ever reaching counter value (in any state). Note that 
since we assume the integers nij are encoded in binary, in principle this reduction yields an OC-MDP that 
is exponentially larger than the input solvency game. Of course, to make this a polynomial time reduction 
we can simply assume that the integers ntj are encoded in unary. Nevertheless, we show that even when the 
nijs are encoded in binary, all qualitative problems for solvency games are decidable in polynomial time: 

Proposition 17. Given a solvency game, it is decidable in polynomial time whether the gambler has a 
strategy to go bankrupt with probability: > 0, = 1, = 0, or < I. 

Proof. The first three cases ( > 0, = 1, = 0) are either trivial, or follow fairly easily from what we have 
established about OC-MDPs, so we do these first. The last case, < 1 , is not easy at all, but follows by using 
a lovely theorem about non-homogeneous controlled random walks by Durrett, Kesten, and Lawler [8]. 

> 0: The gambler has a strategy to go bankrupt with probabihty > 0, precisely when there exists an action 
Ai such that there is a negative number nij < in its support (i.e., in the support of the corresponding 
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finite-support distribution on the integers). If such an action A, exists, then clearly playing action A; 
repeatedly yields a non-zero probability of eventually going bankrupt. If no such action exists, then the 
gambler's wealth never decreases and thus he/she never goes bankrupt, no matter what it does. 
= 1: We wish to know whether the gambler has a strategy with which it will go bankrupt with probability 1. 
(Never mind that the gambler would be stupid to do this.) 

Note that, by the reduction to OC-MDPs described above, this case is equivalent to whether in the 
resulting OC-MDP the controller has a strategy to terminate (i.e., hit counter value 0) in any state, with 
probability 1. Note that this is the non-selective termination condition (NT). Thus by Theorem 12, if the 
supremum probability, over all strategies, of terminating is 1, then there is in fact a counter-oblivious 
memoryless (CMD) optimal strategy, cr, for terminating with probability 1. But note that there is only 
one controlled state in the OC-MDP (the state s), from which the controller chooses one of the actions 
Ai, . . . , Ayt. Thus, the CMD strategy cr amounts to always choosing the same action. A,. Translating 
this strategy back to the solvency game, if the supremum probability of bankruptcy is 1, then there is 
an optimal action A,- that the gambler should choose repeatedly for ever, which achieves bankruptcy 
probability = 1. 

How do we decide which action does this? This is simple: let the drift, E{Ai\, associated with an action 
Ai be the expected change in the counter value if we take action A,- once. This can clearly be computed 
easily in polynomial time from the description of the probability distribution for A,-. 
Note that once we fix an action A,- that we will choose forever, this basically yields a 1-dimensional 
homogeneous random walk on the integers, starting from a positive integer. It then follows from a basic 
results in the theory of random walks and sums of i.i.d. random variables (see, e.g., [6] Theorem 8.2.5 
and Theorem 8.3.4) that, fixing action A;, the resulting random walk (starting with a positive wealth) 
will hit wealth (bankruptcy) with probability 1 if and only if both of the following conditions hold: (1) 
E[Ai\ < (i.e., the drift is not positive, and (2) A,- has some negative value n,j < in its support. 
We can of course check these conditions individually for each action A,-, and we answer yes precisely if 
some action satisfies these conditions. 
= 0: Is there a strategy for the gambler to not go bankrupt with probability 1? Clearly, this is the case if and 
only if there exists an action A, which does not have a negative number Uij < in its support. It is trivial 
to check this. 

< 1: Finally, we come to the most interesting and difficult case: is there a strategy for the gambler to go 
bankrupt with probabiUty < 1, i.e., to not go bankrupt with positive probabiUty? 
Note that: 

1 . If there exists an action A, which does not have a negative integer Uij < in its support, then playing 
that action repeatedly suffices to not go bankrupt (in fact to not go bankrupt with probability 1). 

2. If there exists an action A, such that E[Ai\ > (i.e., whose drift is positive), then again by basic facts 
about random walks and sums of i.i.d. random variables (again, see, e.g.. Theorems 8.2.5 and 8.3.4 
of [6]), starting with any positive wealth, with positive probability the wealth will never hit 0. 

Clearly, both conditions (1.) and (2.) can be checked easily in polynomial time. 

Is there any other possible way for the gambler to not go bankrupt with positive probability, perhaps by 
using some combination of different actions as its strategy? We shall now see that this is not possible. If 
no action satisfies either of the above two conditions, then there is no strategy at all for the gambler to 
not go bankrupt with positive probability. 

This follows for a lovely (and quite non-trivial to prove) result due to Durrett, Kesten and Lawler [8] 
about non-homogeneous controlled random walks (or, as they put it, about when one can and caimot 
"make money from fair games"). Specifically, Theorem 1 of [8] says the following: suppose a gambler 
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gets to choose a sequence Xi,X2,X3, ... of independent random variables whose range is over the reals, 
such that the Z;'s, although not necessarily identically distributed, do have the property that they are 
only finitely inhomogeneous, meaning that there exists a finite family of probability distributions T - 
{F\, . . .,Fk} over the reals, such that for all i e N, the distribution of comes from the family T. 
Suppose, furthermore, that every distribution in T has mean 0, i.e., E\Xi\ - 0, for all /, and has finite 
non-zero variance, i.e., < Var\Xi\ < oo, for all /. Let 5„ = 2"=i for n eM. The gambler's stategy 
can be adapted, meaning its choice of distribution for X, can depend on the outcomes from Xi,.. . 
Theorem 1 of [8] says that as long as these conditions hold, the sequence of random variables 5„ is 
recurrent, meaning there is some < L < oo such that Proh{S „ e [-L,L] i.e.) - 1, or in other words, 
such that the probability that 5„ e [-L, L] infinitely often (i.e., for infinitely many n) is 1.'*' Note that 
this also means that for an fixed value Z) < 0, with probability 1 the sequence Sn will eventually hit a 
value < D. (This is because it will have infinitely many "shots" at hitting a value < D from a starting 
point inside the interval [-L,L], and each such shot has a positive probability which is bounded away 
from by a positive 6 > 0. This later fact holds because there are only finitely many distributions to 
choose from, and each distribution is non-trivial because it has non-zero variance.) 
Let us see now why this implies that the only conditions under which the gambler has a strategy not to 
go bankrupt with positive probability are when either one of conditions (1.) or (2.) above hold. 
Consider the set of actions A\,. . . ,Ak. Suppose neither condition (1.) nor (2.) holds for any of these 
actions. Thus, each action A; has some negative integer nij < in its support, and furthermore no action 
A; has positive drift, i.e., for all actions A„ E[Ai] < 0. 

Let us first assume that all actions have drift 0, i.e., for all i, E[Ai] = 0. In this case, since each action 
has a negative integer in its support, clearly Var[Ai] > 0. Furthermore, for every / the distribution of A; 
has only finite support, clearly Var[A,] < oo. Thus we are in exactly the situation of Theorem 1 of [8], 
and consequently we know that regardless of what wealth D we start with, with probability 1 the wealth 
will eventually hit a value < 0. 

What if there are some actions A; for which E[Ai] < 0? Well, intuitively, this can only favor the prob- 
ability of bankruptcy. More formally, we can do as follows: for every action A, with E[Ai] < 0, ob- 
tain a new random variable A^ from A,- by letting A'. - Ai - £[A,]. Clearly, E\A'^ - 0. Furthermore, 
< yar[Ap < oo, because the same holds for A,-. Thus, for these revised random variables, again, the 
condition holds that starting from any positive wealth the gambler eventually goes bankrupt with proba- 
bility 1, regardless of the strategy. But sums of these revised random variables are always just rightward 
translations of sums of the original set of random variables. So if we go bankrupt with probability 1 
with the revised random variables, then we would also go bankrupt with probability 1 with the original 
random variables. This completes the proof. 

Thus checking cases (1.) and (2.) for each action yields a correct polynomial time algorithm for deter- 
mining whether there is a strategy for the gambler to not go bankrupt with positive probability. 

□ 



C Why bounding the counter can yield bad approximations 

As discussed in the introduction, here is a simple example for why cutting off the counter at a finite value, 
even for a purely stochastic QBD (equivalently, a probabilistic one-counter automaton) can in general radi- 
cally alter its behavior. Consider a 2-state QBD which in state 1, with probability p - 1/2" goes to state 2, 

Incidentally, in [8] they also note that without the condition that 'Var\Xi\ < oo, there are simple examples where 5„ — » oo almost 
surely. In other words, without such conditions on higher moments, one can indeed make money from fair games. 
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and with probability \ -p stays in state 1, in both cases incrementing the counter, and in state 2 stays in state 
2 with probability 1 and decrements the counter. We are interested in the probability of termination starting 
at state 1, with counter value 1. By cutting off the counter at a value N e 2"^"^ the termination probability 
goes down to e arbitrarily close to 0, for large enough n. Although we used small probabilities 1 /2" in this 
example, the same thing can easily be achieved using a QBD with 0(n) states and only the probability 1 /2 
on transitions. 
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